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Preface 



This book aims to capture recent advances in motion compensation for ef- 
ficient video compression. It investigates linearly combined motion compen- 
sated signals and generalizes the well known supeiposition for bidirectional 
prediction in B-pictures. The number of superimposed signals and the selec- 
tion of reference pictures will be important aspects of the discussion. 

The application oriented part of the book employs this concept to the well 
known ITU-T Recommendation H.263 and continues with the improvements 
by superimposed motion-compensated signals for the emerging ITU-T Rec- 
ommendation H.264 and ISO/IEC MPEG-4 (Part 10). In addition, it discusses 
a new approach for wavelet-based video coding. This technology is currently 
investigated by MPEG to develop a new video compression standard for the 
mid-term future. 

The theoretical paid of the book provides a deeper understanding of the un- 
derlying principles of superimposed motion-compensated signals. The text in- 
corporates more than 200 references, summarizes relevant prior work, and de- 
velops a mathematical characterization of superimposed motion-compensated 
signals. The derived information-theoretic performance bounds permit a valu- 
able comparison of the investigated compression schemes. 
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Chapter 1 



INTRODUCTION 



Motion-compensated prediction is an important component of current hy- 
brid video coding systems. In recent years, advances in compression efficiency 
have mainly been achieved by improved motion-compensated prediction, e.g. 
sub-pel accurate motion compensation [1], variable block size prediction [2], 
multiframe prediction [3-5], and multihypothesis motion compensation. 

Multihypothesis motion-compensated prediction linearly combines multi- 
ple motion-compensated signals to arrive at the actual prediction signal. This 
term was first used in [6] to provide a framework for overlapped block motion 
compensation (OBMC). OBMC was introduced to reduce blocking artifacts 
in motion-compensated prediction [7]. In earlier work, attempts have been 
made to combine signals from more than one frame. Published in 1985, [8] 
investigates adaptive predictors for hybrid coding that use up to four previous 
fields. In the same year, the efficiency of bidirectional prediction has been 
raised in [9]. To predict the current frame, bidirectional prediction uses a lin- 
ear combination of two motion-compensated signals: one is chosen from the 
next reference frame, the other from the previous reference frame. Bidirec- 
tional prediction characterizes the now known concept of B-pictures which 
has originally been proposed to MPEG [10]. The motivation was to interpolate 
any skipped frame taking into account the movement between the two “end” 
frames. The technique, originally called conditional motion-compensated in- 
terpolation, coupled the motion-compensated interpolation strategy with trans- 
mission of significant interpolation errors. 

These practical schemes have been studied in [11] and summarized in a 
theoretical analysis of multihypothesis motion-compensated prediction. The 
analysis is based on a power spectral density model for inaccurate motion 
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compensation [12] which has been proven successful to characterize motion- 
compensated prediction. Variations of these practical schemes have also been 
standardized in, e.g., [13] and [14]. 

With the advent of multiframe prediction, the question of linearly com- 
bined prediction signals has been re-visited. [15, 16] design general predictors 
for block-based superimposed motion compensation that utilize several pre- 
vious reference frames. To determine an efficient set of motion-compensated 
blocks, an iterative algorithm is used to improve successively conditional op- 
timal motion-compensated blocks. A similar algorithm has been proposed in 
[17] for bidirectional prediction only. 

Multiple frames are not only used for predictive coding. Schemes for three- 
dimensional subband coding of video signals consider also multiple frames 
[18, 19]. Adaptive wavelet transforms with motion compensation can be used 
for temporal subband decomposition [20]. These schemes use again linear 
combinations of motion-compensated signals and are also of interest for our 
investigations. 

This book “Video Coding with Superimposed Motion-Compensated Sig- 
nals”, contributes to the field of motion-compensated video coding as follows: 

1 For video compression, we investigate the efficiency of block-based su- 
perimposed prediction with multiframe motion compensation based on the 
ITU-T Rec. H.263. We explore the efficient number of superimposed pre- 
diction signals, the impact of variable block sizes compensation, and the 
influence of the size of the multiple reference frame buffer if the reference 
frames are previous frames only. 

2 We generalize B-pictures for the emerging ITU-T Rec. H.264 to the generic 
concept of superimposed prediction which chooses motion-compensated 
blocks from an arbitrary set of reference pictures and measure the improve- 
ment in compression efficiency. Further, the generic concept of superim- 
posed prediction allows also that generalized B-pictures are used for ref- 
erence to predict other B-pictures. As this is not the case for classic B- 
pictures, we explore the efficiency of this aspect too. 

3 We build on the theory of multihypothesis motion-compensated prediction 
for video coding and extend it to motion-compensated prediction with com- 
plementary hypotheses. We assume that the displacement errors of the mul- 
tiple hypotheses are jointly distributed and, in particular, correlated. We 
investigate the efficiency of superimposed motion compensation as a func- 
tion of the displacement error correlation coefficient. We conclude that 
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compensation with complementary hypotheses results in maximally nega- 
tively correlated displacement error. We continue and determine a high-rate 
approximation for the rate difference with respect to optimal intra-frame 
encoding and compare the results to [11]. To capture the influence of mul- 
tiframe motion compensation, we model it by forward-adaptive hypothesis 
switching and show that switching among M hypotheses with statistically 
independent displacement error reduces the displacement error variance by 
up to a factor of M. 

4 We do not only consider predictive coding with motion compensation. We 
explore also the combination of linear transforms and motion compensa- 
tion for a temporal subband decomposition of video and discuss motion 
compensation for groups of pictures. Therefore, we investigate experi- 
mentally and theoretically motion-compensated lifted wavelets for three- 
dimensional subband coding of video. The experiments capture the coding 
efficiency dependent on the number of pictures in the group and permit 
a comparison to predictive coding with motion compensation. The theo- 
retical discussion analyzes the investigated lifted wavelets and builds on 
the insights from motion compensation with multiple hypotheses. Further, 
the analysis provides performance bounds for three-dimensional subband 
coding with motion compensation and gives insight about potential coding 
gains. 

This book is organized as follows: Chapter 2 provides the background for 
video coding with superimposed motion-compensated signals and discusses 
related work. Chapter 3 investigates motion-compensated prediction with 
complementary hypotheses. Based on ITU-T Rec. H.263, Chapter 4 explores 
experimental results for video coding with superimposed motion-compensated 
prediction and multiple reference frames. Chapter 5 discusses generalized B- 
pictures for the emerging ITU-T Rec. H.264. Finally, Chapter 6 explores linear 
transforms with motion compensation and its application to motion compensa- 
tion for groups of pictures. 




Chapter 2 



BACKGROUND AND RELATED WORK 



2.1 Coding of Video Signals 

Standard video codecs like ITU-T Recommendation H.263 [14, 21] or the 
emerging ITU-T Recommendation H.264 [22] are hybrid video codecs. They 
incorporate an intra-frame codec and a motion-compensated predictor. The 
intra-frame codec is able to encode and decode one frame independently of 
others, whereas the motion-compensated predictor is able to compensate mo- 
tion between the current frame and a previously decoded frame. 




Figure 2.1. Hybrid video codec utilizing motion-compensated prediction. 

Fig. 2. 1 depicts such a hybrid video codec. The encoder estimates the mo- 
tion between the current frame s and a previously decoded frame r and trans- 
mits it as side information to the decoder. Both encoder and decoder use mo- 
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tion compensation to generate the motion-compensated frame s from previ- 
ously reconstructed frames which are also available at the decoder. There- 
fore, only the frame difference e between the current frame and the motion- 
compensated frame needs to be encoded by the intra-frame encoder. This 
frame difference has much less signal energy than the current frame and, hence, 
requires less bit-rate if encoded. Despite the side information, the overall bit- 
rate of a hybrid video codec is less than the bit-rate of a video codec with 
intra-frame coding only. Therefore, motion-compensated prediction is an im- 
portant component for efficient compression with a hybrid video codec. 

Hybrid video codecs require sequential processing of video signals which 
makes it difficult to achieve efficient embedded representations of video se- 
quences. Therefore, we consider also motion-compensated three-dimensional 
subband coding of video signals [23-25]. Applying a linear transform in tem- 
poral direction of a video sequence may not be very efficient if significant 
motion is prevalent. Motion compensation between two frames is necessary 
to deal with the motion in a sequence. A combination of linear transform and 
motion compensation is required for efficient three-dimensional subband cod- 
ing. 

In the following, we review relevant techniques and principles for state-of- 
the-art video coding. The discussion provides a background for the following 
chapters and summarizes work on which we will build. Section 2.2 outlines 
several relevant methods for motion-compensation: bidirectional motion com- 
pensation, overlapped block motion compensation, variable block size motion 
compensation, multiframe motion compensation, and superimposed motion 
compensation. Section 2.3 discusses previous work on rate-constrained mo- 
tion estimation, rate-constrained estimation of superimposed motion, quantizer 
selection at the residual encoder, and techniques for efficient motion estima- 
tion. Section 2.4 introduces to a theory for motion-compensated prediction. 
We discuss the underlying frame signal model, review the model for motion- 
compensated prediction, and outline the state-of-the-art for multihypothesis 
motion-compensated prediction. We reference the utilized performance mea- 
sures and summarize the results of this theory. Finally, Section 2.5 summarizes 
previous work on three-dimensional subband coding of video. We outline the 
problem of motion compensation for the temporal subband decomposition and 
refer to adaptive lifting schemes that permit motion compensation. 

2.2 Motion Compensation 

The efficiency of inter-frame coding schemes for video sequences is im- 
proved by motion compensation. Efficient motion compensation requires an 
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accurate measurement of the displacement field between two frames. A prac- 
tical algorithm for this measurement is block matching [26] [27]. It estimates 
the displacement field on a block bases, i.e., approximates each block with 
one displacement value by minimizing its prediction error. Efficient motion- 
compensated coding is desirable especially for low bit-rate video applications 
[28]. Usually, block-based schemes assume just translatory motion for all pels 
in the block. But more sophisticated schemes like spatial transforms [29] and 
transformed block-based motion compensation [30] are possible. And by omit- 
ting the block constraint, arbitrarily shaped regions can be used for motion 
compensation [31]. 

Efficient inter-frame coding schemes consider also the problem of noise re- 
duction in image sequences. Dubois and Sabri describe in [32] a nonlinear 
temporal filtering algorithm using motion compensation. Woods et al. present 
a spatio-temporal adaptive 3-D Kalman filter with motion compensation [33] 
and couple it with motion estimation in [34]. 

2.2.1 Bidirectional Motion Compensation 

Frame skipping is a viable technique to reduce drastically the bit-rate nec- 
essary to transmit a video signal. If all frames have to be reconstructed at 
the decoder, skipped frames must be interpolated by a motion compensation 
scheme. Netravali and Robbins propose such a scheme in [35] and initiate 
further research in this field. Soryani and Clarke combine image segmenta- 
tion and adaptive frame interpolation [36], Thoma and Bierling consider cov- 
ered and uncovered background for motion-compensated interpolation [37], 
and Cafforio et al. discuss a pel-recursive algorithm [38]. Similar algorithms 
for adaptive frame interpolation are outlined in [39^-2]. KovaCeviC et al. use 
adaptive bidirectional interpolation for deinterlacing [43, 44]. Au et al. study 
also temporal frame interpolation [45] and compare block- and pixel-based in- 
terpolation in [46]. They propose temporal interpolation with overlapping [47] 
and unidirectional motion-compensated temporal interpolation [48]. They also 
suggest zonal based algorithms for temporal interpolation [49]. These algo- 
rithms allow efficient block-based motion estimation [50, 51]. 

To improve the quality of the interpolated frames at the decoder, the inter- 
polation error is encoded and additional bits are transmitted to the decoder. 
Micke considers first the idea of interpolation error encoding in [52]. Haskell 
and Puri [53] as well as Yonemitsu and Andrews [54] propose algorithms for 
this approach. This type of coded picture is also called B-picture. Puri et al. 
investigate several aspects of this picture type, like quantization and tempo- 
ral resolution scalability [55-58]. Additionally, Shanableh and Ghanbari point 
out the importance of B -pictures in video streaming [59]. 
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Lynch shows how to derive B-picture motion vectors from neighboring P- 
pictures [60]. But the displacement field of the B-pictures can also be encoded 
and transmitted to the decoder. For example, Woods et al. suggest compactly 
encoded optical-flow fields and label fields for frame interpolation and bidirec- 
tional prediction [61]. 

B-pictures, as they are employed in MPEG-1 [62], MPEG-2 [13], or H.263 
[63], utilize Bidirectional Motion Compensation. Bidirectional motion-com- 
pensated prediction is an example for multihypothesis motion-compensated 
prediction where two motion-compensated signals are superimposed to reduce 
the bit-rate of a video codec. But the concept of B-pictures has to deal with 
a significant drawback: prediction uses the reference pictures before and after 
the B-pictures as depicted in Fig. 2.2. The associated delay of several frames 
may be unacceptable for interactive applications. In addition, the two motion 
vectors always point to the previous and subsequent frames and the advantage 
of a variable picture reference cannot be exploited. 




Figure 2.2. Bidirectional motion compensation. Reference pictures are used only before and 
after the current frame. 

The efficiency of forward and backward prediction has already been raised 
by Musmann et al. [9] in 1985. The now known concept of B-pictures was 
proposed to MPEG by Puri et al. [10]. The motivation was to interpolate 
any skipped frame taking into account the movement between the two “end” 
frames. The technique, called conditional motion-compensated interpolation, 
couples the motion-compensated interpolation strategy with transmission of 
the significant interpolation errors. 

For joint estimation of forward and backward motion vectors in bidirec- 
tional prediction, a low-complexity iterative algorithm is introduced in [64]. 
Starting from the initial values obtained by a commonly-used block match- 




Background and Related Work 



9 



ing independent search method, the motion vectors are iteratively refined until 
a locally optimal solution to the motion estimation problem for bidirectional 
prediction is achieved. Each iteration consists of a series of two similar pro- 
cedures. First, the backward motion vector is fixed and a new forward motion 
vector is searched to minimize the prediction error. Then the forward motion 
vector is fixed and the backward motion vector is similarly refined by mini- 
mizing the prediction error. This process is repeated until the prediction error 
no further decreases. The iterative search procedure minimizes the prediction 
error and considers no rate constraint [17]. The price paid for the improvement 
in performance is only a small increase in computational complexity relative 
to independent search for the two motion vectors. Experiments in [17] show 
that the increase in search complexity is, on average, less than 20% of that of 
the independent search. Based on this work, [65] proposes an efficient motion 
estimation algorithm. 

2.2.2 Overlapped Block Motion Compensation 

Like bidirectional prediction. Overlapped Block Motion Compensation 
(OBMC) [7, 6, 66-68] is another example of the general concept of multi- 
hypothesis motion-compensated prediction. Originally, the motivation was to 
reduce blocking artifacts caused by block motion compensation. Sullivan and 
Baker introduced motion compensation using control grid interpolation [69] 
and Watanabe and Singhal windowed motion compensation [70]. OBMC uses 
more than one motion vector for predicting the same pixel but, in contrast to 
bidirectional prediction, does not increase the number of vectors per block. 

The discussion of overlapped compensation in [6] is based on ‘multi- 
hypothesis expectation’ . The paper argues that only one block motion vector 
is encoded for a large block of pixels and that the vector value is limited in 
precision such that an encoded block motion vector may not be correct for all 
pixels in the block. For this reason, [6] proposes a multi-hypothesis expecta- 
tion paradigm. Since the decoder cannot know the correct motion vector for 
each pixel, the motion uncertainty is modeled by a posteriori inferred displace- 
ment probability density function, conditioned on the encoded data. Using this 
distribution, an ideal decoder could generate a minimum mean square error es- 
timate for each pixel prediction. A vector of linear weights and an associated 
set of displacements are defined to determine the prediction for each pixel. It is 
reported that the method effectively eliminates blocking artifacts and reduces 
prediction error. 

[67] proposes an estimation-theoretic paradigm for analyzing and optimiz- 
ing the performance of block-based motion compensation algorithms. OBMC 
is derived as a linear estimator of each pixel intensity, given that the only 
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motion information available to the decoder is a set of block-based vectors. 
OBMC predicts the frame of a sequence by repositioning overlapping blocks 
of pixels from the reference frame, each weighted by some smooth window. 
The estimation-theoretic formulation leads directly to statistical techniques for 
optimized window design. The work also considers the problem of optimized 
motion estimation. Viewed within the estimation-theoretic paradigm for block- 
based motion compensation, the objective of motion estimation is to provide 
the decoder information that optimizes the performance of its prediction. Op- 
timal motion estimation for OBMC involves estimating a noncausally related 
motion field, and an iterative procedure is proposed for solving this problem. 

Tao and Orchard continue the investigation. With the goal to remove mo- 
tion uncertainty and quantization noise [71], they discuss OBMC and loop fil- 
tering in [72], propose a method for window design [73], and investigate non- 
iterative motion estimation [74]. They also suggest a gradient-based model for 
the residual variance [75] and propose a parametric solution for OBMC [76]. 

2.2.3 Variable Block Size Motion Compensation 

Motion-compensated prediction with blocks of variable size improves the 
efficiency of video compression algorithms by adapting spatially displacement 
information [77, 2, 78-82]. Variable Block Size (VBS) prediction assigns more 
than one motion vector per macroblock but it uses just one motion vector for a 
particular pixel. 

[2] describes a method for optimizing the performance of block-based 
motion-compensated prediction for video compression using fixed or variable 
size blocks. A Lagrangian cost function is used to choose motion vectors and 
block sizes for each region of the prediction image, that gives the best perfor- 
mance in a rate-distortion sense. For that, a quadtree is used to structure blocks 
of variable size. The variable block size algorithm determines the optimal tree 
structure and yields a significant improvement in rate-distortion performance 
over motion compensation with a fixed block size. 

[81, 82] investigate a more general tree structure for motion- and intensity- 
compensated video coding. In contrast to variable block size motion com- 
pensation, this approach incorporates also the intensity residual into the tree 
structure. The work discusses pruning and growing algorithms to determine 
rate-distortion optimal tree structures by utilizing a Lagrangian cost function. 

ITU-T Rec. H.263 [14] provides variable block size capability. The IN- 
TER4V coding mode allows 8x8 blocks in addition to the standard 16 x 16 
blocks. VBS motion-compensated prediction utilizes either OBMC or an in- 
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loop deblocking filter to reduce blocking artifacts. The emerging ITU-T Rec. 
H.264 [22] allows up to 7 different block sizes from 16 x 16 to 4 x 4. Here, 
the blocking artifacts are reduced by an in-loop deblocking filter. 

2.2.4 Multiframe Motion Compensation 

Multiframe techniques have first been utilized for background prediction by 
Mukawa and Kuroda [83]. The method permits a prediction signal for un- 
covered background. With a special background memory, the method is also 
investigated by Hepper [84]. Lavagetto and Leonardi discuss block- adaptive 
quantization of multiple-frame motion fields [85]. Gothe and Vaisey improve 
motion compensation by using multiple temporal frames [3]. They provide ex- 
perimental results for 8x8 block-motion compensation with up to 8 previous 
frames. Fukuhara, Asai, and Murakami propose low bit-rate video coding with 
block partitioning and adaptive selection of two time-differential frame mem- 
ories [86]. The concept of multiframe motion-compensated prediction is used 
by Budagavi and Gibson to control error propagation for video transmission 
over wireless channels [4, 87-89]. 




Figure 2.3. Multiframe motion compensation. A reference frame is chosen by an additional 
picture reference parameter. 

Long-term memory motion-compensated prediction [5, 90-93] employs 
several reference frames, i.e., several previously decoded frames, whereas stan- 
dard motion-compensated prediction utilizes one reference frame, i.e. the pre- 
viously decoded frame. This is accomplished by assigning a variable picture 
reference parameter to each block motion vector as shown in Fig. 2.3. The 
additional reference parameter overcomes the restriction that a specific block 
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has to be chosen from a certain reference frame. This generalization improves 
compression efficiency of motion-compensated prediction. 

In [91], the long-term memory covers several seconds of decoded frames at 
the encoder and decoder. The use of multiple frames for motion compensation 
in most cases provides significantly improved prediction gain. The variable 
picture reference that permits the use of several frames has to be transmitted 
as side information requiring an additional bit-rate which may be prohibitive 
when the size of the long-term memory becomes large. Therefore, the bit -rate 
of the motion information is controlled by employing rate-constrained motion 
estimation to trade-off the better prediction signal against the increased bit- 
rate. 

Multiframe block motion compensation in [89] makes use of the redundancy 
that exists across multiple frames in typical video conferencing sequences to 
achieve additional compression over that obtained by using single frame block 
motion compensation. The multiframe approach also has an inherent ability to 
overcome some transmission errors and is thus more robust when compared to 
the single frame approach. Additional robustness is achieved by randomized 
frame selection among the multiple previous frames. 

Annex U of ITU-T Rec. H.263 [14], entitled “Enhanced Reference Pic- 
ture Selection Mode,” provides multiframe capability for both improved com- 
pression efficiency and enhanced resilience to temporal error propagation due 
to transmission errors. The multiframe concept is also incorporated into the 
emerging ITU-T Rec. H.264 [22]. 

2.2.5 Superimposed Motion Compensation 

Standard block-based motion compensation approximates each block in the 
current frame by a spatially displaced block chosen from the previous frame. 
As an extension, long-term memory motion compensation chooses the blocks 
from several previously decoded frames [91]. The motion-compensated signal 
is determined by the transmitted motion vector and picture reference parame- 
ter. 

Now, consider N motion-compensated signals, also called hypotheses. The 
superimposed prediction signal is the linear superposition of these N hypothe- 
ses. Constant scalar coefficients determine the weight of each hypothesis for 
the predicted block. Only N scalar coefficients are used and each coefficient 
is applied to all pixel values of the corresponding hypothesis. That is, spatial 
filtering and OBMC are not employed. Note, that weighted averaging of a set 
of images is advantageous in the presence of noise as discussed by Unser and 
Eden in [94], 
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Figure 2.4. Superimposed motion-compensated prediction with three hypotheses. Three 
blocks of previous decoded frames are linearly combined to form a prediction signal for the 
current frame. 

Fig. 2.4 shows three hypotheses from previous decoded frames which are 
linearly combined to form the superimposed prediction signal for the current 
frame. Please note that a hypothesis can be chosen from any reference frame. 
Therefore, each hypothesis has to be assigned an individual picture reference 
parameter [15, 16, 95, 96]. 

This scheme differs from the concept of B -frame prediction in three signif- 
icant ways: First, all reference frames are chosen from the past. No reference 
is made to a subsequent frame, as with B-frames, and hence no extra delay is 
incurred. Second, hypotheses are not restricted to stem from particular refer- 
ence frames due to the picture reference parameter. This enables the encoder 
to find a much more accurate set of prediction signals, at the expense of a 
minor increase in the number of bits needed to specify them. Third, it is possi- 
ble to combine more than two motion-compensated signals. As will be shown 
later, these three properties of superimposed motion compensation improve the 
coding efficiency of a H.263 codec without incurring the delay that would be 
caused by using B-pictures. 

A special forward prediction mode with two averaged prediction signals is 
specified in MPEG-2 [13, 97]. This mode is called “Dual Prime Prediction” 
and can be used for predictive compression of interlaced video. It is utilized 
in P-pictures where there are no B-pictures between the predicted and the ref- 
erence pictures. Dual prime is a forward field prediction where a single for- 
ward motion vector is estimated for each macroblock of the predicted frame 
picture. This motion vector points at the reference frame which is the most 
recent reconstructed frame. Using this vector, each field in the macroblock is 
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associated with a field of the same parity in the reference frame. Two motion 
vectors pointing at fields of opposite parity are derived from the estimated mo- 
tion vector by assuming a linear motion trajectory. That is, the motion vectors 
are scaled according to the temporal distance between the reference and the 
predicted frames. The two motion-compensated signals are simply averaged 
to form the prediction. 

The design of the superimposed motion-compensated predictor is such that 
the mean square prediction error is minimized while limiting the bit-rate con- 
sumed by the motion vectors and picture reference parameters. With variable 
length coding of the side information, the best choice of hypotheses will de- 
pend on the code tables used, while the best code tables depend on the prob- 
abilities of choosing certain motion vector/reference parameter combinations. 
Further, the best choice of hypotheses also depends on the linear coefficients 
used to weight each hypothesis, while the best coefficients depend on the co- 
variance matrix of the hypotheses. 

To solve this design problem, it is useful to interpret superimposed motion- 
compensated prediction as a vector quantization problem [98-100]. Entropy 
Constrained Vector Quantization (ECVQ) [101, 102], which is an extension 
of the Generalized Lloyd Algorithm [103], is employed to solve the design 
problem iteratively. For the interpretation of motion-compensated prediction, 
we argue that a block in the current frame is quantized. The output index of 
the quantizer is the index of the displacement vector. Each displacement vec- 
tor is represented by a unique entropy codeword. Further, the codebook used 
for quantization contains motion-compensated blocks chosen from previous 
frames. This codebook is adaptive as the reference frames change with the 
current frame. For superimposed prediction, the codebook contains IV- tuple of 
motion-compensated blocks whose components are linearly combined. 

Rate-constrained superimposed motion estimation utilizes a Lagrangian 
cost function. The costs are calculated by adding the mean square prediction 
error to a rate-term for the motion information, which is weighted by a La- 
grange multiplier [104]. The estimator minimizes this cost function on a block 
basis to determine multiple displacement parameter. This corresponds to the 
biased nearest neighbor condition familiar from vector quantization with rate 
constraint. The decoder combines linearly more than one motion-compensated 
signal which are determined by multiple displacement parameter. In [15, 16], 
several video sequences are encoded to show that the converged design algo- 
rithm just averages multiple hypotheses. 

Superimposed motion compensation requires the estimation of multiple mo- 
tion vectors and picture reference parameters. Best prediction performance 
is obtained when the N motion vectors and picture reference parameters are 
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jointly estimated. This joint estimation is computationally very demanding. 
Complexity can be reduced by an iterative algorithm which improves condi- 
tional optimal solutions step by step [15, 16]. More details are discussed in 
Section 2.3.2. 

It can be observed that increasing the number of hypotheses also improves 
the quality of the prediction signal. The gains by superimposed prediction 
with multiframe motion compensation are larger than with single frame motion 
compensation. That is, superimposed motion-compensated prediction benefits 
from multiframe motion compensation such that the PSNR prediction gain is 
more than additive [15, 16]. 

It is important to note that an A-hypothesis uses N motion vectors and pic- 
ture reference parameters to form the prediction signal. Applying a product 
code for these N references will approximately increase the bit-rate for N- 
hypothesis MCP by factor of N. This higher rate has to be justified by the 
improved prediction quality. 

When predicting a target block in the current frame, we have the choice of 
several different predictors, i.e. 1-hypothesis, 2-hypothesis, ... AMiypothcsis 
predictor. Experiments reveal that each predictor on its own is not the best one 
in the rate-distortion sense. For the same prediction quality, the 1-hypothesis 
predictor provides always the lowest bit-rate. On the other hand, improved 
prediction quality can only be obtained by increasing the number of hypothe- 
ses. Therefore, the optimal rate-distortion performance results from selecting 
the predictor that gives the best rate-distortion performance. Moreover, this 
selection depends on the block to be predicted [15, 16]. 

2.3 Motion Estimation 

Classic motion estimation aims to minimize the energy in the displaced 
frame difference. But actually, it is a bit allocation problem for both motion 
vectors and displaced frame difference as they are dependent. Ramchandran, 
Ortega and Vetterli investigate in [105] the problem of bit allocation for de- 
pendent quantization. They apply it to the problem of frame type selection like 
I ,h f. and Dickinson in [106]. But Ribas-Corbera and Neuhoff in [107] as well 
as Schuster and Katsaggelos in [108] discuss in detail bit allocation between 
displacement and displaced frame difference. Additional segmentation of the 
displacement field is investigated in [109] and [110]. Woods et al. discuss 
motion vector quantization for video coding [111, 112] as well as multiscale 
modeling and estimation of motion fields [113]. Rate-constrained motion es- 
timation which utilizes a Lagrangian cost function is discussed by several au- 
thors [114-118]. 
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2.3.1 Rate-Constrained Motion Estimation 

In the following discussion, we relate the problem of block-based motion- 
compensated prediction to vector quantization with a rate constraint [15]. For 
block-based motion-compensated prediction, each block in the current frame 
is approximated by a spatially displaced block chosen from a reference picture. 
Each a x a block is associated with a vector in the tJ 2 -dimensional space TZ a . 
A original block is represented by the vector-valued random variable s. A 
particular original block is denoted by s. The quality of motion-compensated 
prediction is measured by the average sum square error distortion between 
original s and predicted blocks s. 

{ ||s - sf^} (2.1) 

The blocks are coded with a displacement code b. Each displacement code- 
word b provides a unique rule how to approximate the current block-sample s. 
The average rate of the displacement code is determined by its average length. 

R — E { |b| } (2.2) 



Rate-distortion optimal motion-compensated prediction minimizes average 
prediction distortion subject to a given average displacement rate. This con- 
strained problem can be converted to an unconstrained problem by defining a 
Lagrangian cost function / with a Lagrange multiplier X [104, 101, 100, 119]. 



J(k) = E 




(2.3) 



The predictor with the minimum Lagrangian costs is also a rate-distortion op- 
timal predictor. 




Figure 2.5. Interpreting motion-compensated prediction as vector quantization. 

The ECVQ algorithm [101] suggests a vector quantizer model according 
to Fig. 2.5. This model is interpreted for motion-compensated prediction as 
follows: Given the original blocks, the mapping a estimates the best displace- 
ment index i in the codebook of reference blocks (frame memory). The map- 
ping Y assigns a variable length codeword to each displacement index. To be 
lossless, Y has to be invertible and uniquely decodable [101]. For block-based 
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motion-compensated prediction, /3 is a codebook look-up of reference blocks 
to determine the block s that is used for prediction. 

Minimizing the Lagrangian costs (2.3) provides the rate-distortion optimum 
predictor, that is, the optimum mappings a and y . For that, the Lagrangian 
cost function (2.3) is expressed in terms of the model mappings a and y, 

J(a, y, X, s) = E {||s - p o a(s)|£ + X\y o «(s)|} (2.4) 

with the blocks s = /3 o a(s) used for prediction and the codewords b = 
y o a(s). For a given distribution of the original blocks s c in the training set 
and constant Lagrange multiplier X c , the optimal predictor incorporates the 
optimal mappings a and y which satisfy 

min / (a, ft, y, X c , s c ). (2.5) 

(«.r) 



Given the distribution of the original blocks s c in the training set as well 
as the Lagrange multiplier X c , an iterative design algorithm for solving (2.5) 
includes two steps. For motion-compensated prediction, retrieves the com- 
pensated block from the frame memory, which is simply a codebook look-up. 
The first step determines the optimal displacement index i = a (s) for the given 
mapping y c . 



min£ (||s c - p c oa(s c )\\l + X c \y c o a(s c )|} 

{«! 

==> «(J) = ar S™ n {Ik - Pe(i)\\l + *clFcO')l} (2.6) 

(2.6) is the biased nearest neighbor condition familiar from vector quantization 
with a rate-constraint. The second step determines the optimal entropy coding 
y for the given motion estimation a c . For a given a c , the distribution of the 
displacement indices i c is constant. 

min E { ||s c - p c oa c {s c )\\\ + X c \y oa e (s c )|j 
Ik) 

=> min E {|y(i c )|} (2.7) 

(v) 

(2.7) postulates the minimum average codeword length of the displacement 
code, given the displacement indices. This problem can be solved with, e.g., 
the Huffman algorithm. Finally, given the entropy code y, the problem of 
rate-constrained motion estimation is solved by (2.6). 
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2.3.2 Rate- Constrained Estimation of Superimposed 

Motion 

For rate-constrained estimation of superimposed block motion, the same 
methodology is employed as outlined for rate-constrained motion estimation. 
Each block in the current frame is approximated by more than one spatially 
displaced block chosen from the set of reference pictures. Each a x a block is 
associated with a vector in the a 2 -dimensional space TZ a . A original block is 
represented by the vector-valued random variable s. A particular original block 
is denoted by s. 

The vector quantizer model according to Fig. 2.5 is interpreted for motion- 
compensated prediction with N superimposed blocks as follows: Given the 
original block s, the mapping a estimates again the best displacement index 
i. The mapping y assigns a variable length codeword to each displacement 
index, y is invertible and uniquely decodable. For motion-compensated pre- 
diction with superimposed blocks, /3 is a codebook look-up of a row-vector 
C = (ci, C2, • • • , Cjv) of N motion-compensated blocks. These N motion- 
compensated blocks are superimposed and determine the block s that is used 
to predict s. The weighted superposition of N motion-compensated blocks is 
accomplished with N scalar weights h /x , pt = 1,2 , ,N, that sum to one. 

For simplicity, the vector of scalar weights is denoted by h. 

h\ 

hi 

hu 

For the scheme of superimposed motion, minimizing the Fagrangian costs 
(2.3) provides the rate-distortion optimum predictor for superimposed blocks, 
that is, the optimum mappings a, ji, and y. The Fagrangian cost function (2.3) 
is expressed in terms of the model mappings a, /3, and y according to (2.4). 
The blocks s = fi o a(s) are used to predict s, and the codewords b = y o a(s) 
to encode the displacement indices. For a given distribution of the original 
blocks s c in the training set and constant Fagrange multiplier \ c , the optimal 
predictor incorporates the optimal mappings a, j3, and y which satisfy 

min J(a, P, y, X c , s c ). (2.9) 

IM.y) 

Given the distribution of the original blocks s c in the training set as well 
as the Fagrange multiplier A c , an iterative design algorithm for solving (2.9) 






Background and Related Work 



19 



includes three steps. For superimposed motion-compensated prediction, is 
also optimized. The first step determines the optimal displacement indices 
i = a(s) for the given mappings fi c and y c . 



min E { || s c - p c o a (s c ) ||| + k c \ y c o a(s c ) | } 

(a) 

a(s) = argmin (||j - p c (i)\\j + X c |y c (0|} 



(2.10) 



(2.10) is the biased nearest neighbor condition familiar from vector quantiza- 
tion with a rate-constraint. The second step determines the optimal entropy 
coding y for the given estimation of superimposed motion a c . For given a c , 
the distribution of the displacement indices i c is constant. 



min E (||s c - o ar c (s c )||| + \ c \y oa c (s c )|} 
(xl 

rnin E [\y (i e ) | } 

(y) 



( 2 . 11 ) 



(2.11) postulates a minimum average codeword length of the displacement 
code, given the displacement indices. The third step determines the optimal 
superposition, i.e., the optimal scalar weights, given the mappings a c and y c . 



min E { ||s c - £ o o; c (s c ) ||| + k c \y c oa c (s c )|} 
i/b 



min E 

{h\l T h=\) 



-C c h\\l} 



(2.12) 



(2.12) is the Wiener problem for the conditional optimal superposition coeffi- 
cients. The superimposed predictor preserves the expected value of the original 
block, i.e., E {s} = E {is}. Consequently, the Wiener problem can be expressed 
in covariance notation 



min (C ss — 2/i 5 

[h:l T h=l] 



Ccs + h T C cc h] 



( 2 . 13 ) 



where C ss is the scalar variance of the original blocks, C cc the NxN covariance 
matrix of the motion-compensated blocks, and C the N x I covariance vector 
between the motion-compensated blocks and the original blocks. (2.13) is a 
constrained Wiener problem as the scalar weights h M sum to 1, i.e., 1 T h = 1. 
A Lagrangian approach leads to the conditional optimal superposition coeffi- 
cients 

c - irc; 



h = c: 



' cc 

1 T c, 



‘Ccs— 1 \ 

Cc"c l l ) ' 



(2.14) 



Experiments with video sequences in [15, 16] reveal that the optimum su- 
perposition coefficients are approximately ^ for N superimposed motion- 
compensated blocks. 
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Given the entropy code y and the weighted superposition /9, the problem of 
rate -constrained estimation of superimposed motion is solved by (2.10). The 
complexity of superimposed motion estimation increases exponentially with 
the number of superimposed blocks N. An iterative algorithm, which is in- 
spired by the Iterated Conditional Modes [120], avoids searching the complete 
space by successively improving N conditional optimal solutions. Conver- 
gence to a local optimum is guaranteed, because the algorithm prohibits an 
increase of the Lagrangian costs. 



0: Assuming N superimposed motion-compensated blocks, the La- 
grangian cost function 



N 



j (q » ■ • • > Cfti • • • » Cf,/') — Is c v h v 



y=l 



+ X|y(i[ci,...,Civ])| 



is subject to minimization for each original block s. Select the 
entropy code y, predictor coefficients h, and the Lagrange multi- 
plier A. Initialize the algorithm with N motion-compensated blocks 
(c®, .... 4 0) ) and set k 0. 



1: Select the /x-th block out of IV; start from the first and end with the 
N- th. 

a: Focus on the /x-th block. All others are kept constant. Select 
a local neighborhood of the block cff as the conditional search 
space for block c*f +1) . 

b: Minimize the Lagrangian cost function by full search within the 
conditional search space and determine the new block c^ +1) . 



min j ( c 






(*+i) 
l > 



A*+D 
V-l * 



.(*+ 1 ) 



-(*) 

Vri’ • • • ’ c n > 



(*h 



2: As long as the Lagrangian cost function decreases, continue with step 
1 and set k := k + 1. 



Figure 2.6. Hypothesis Selection Algorithm for block-based superimposed motion estimation. 

The Hypothesis Selection Algorithm (HSA) in Fig. 2.6 provides a locally 
optimal solution for (2.10). The HSA is initialized with N > 1 motion- 
compensated blocks by splitting the optimal motion-compensated block for 
N = 1. The computational complexity of finding a solution for N = 1 is 
rather moderate. This optimal motion-compensated block is repeated N times 
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to generate the initial vector of N motion-compensated blocks. For each of 
the N motion-compensated blocks in each iteration, the HSA performs a full 
search within a conditional search space. The size of this conditional search 
space affects both the quality of the local optimum and the computational com- 
plexity of the algorithm [15, 16]. 



2.3.3 Quantizer Selection at the Residual Encoder 

Hybrid video codecs usually encode the motion-compensated prediction er- 
ror signal. For that, a uniform scalar quantizer with quantizer step-size Q is 
utilized. Rate-constrained motion estimation raises the question how to select 
the quantizer step-size dependent on the Lagrange multiplier X. A solution to 
this problem is suggested in [121]. 

Given the Lagrangian cost function J = D + kR, total rate R and distortion 
D are in equilibrium for d J = dD+kdR = 0. Consequently, k is the negative 
derivative of the distortion with respect to the total rate. 



k = 



dD 

dR 



(2.15) 



The rate of the motion-compensated predictor R p and the rate of the residual 
encoder R r sum up to the total rate R. Thus, we can write 



and 



dD = 


3D 3D Jn 

dRp H“ dR r 

3R p p 3 R r 


(2.16) 


dR = 


dR p T- dR r 


(2.17) 


3D ' 


\ (dD \ , 


(2.18) 


dF p +k ; 


H s ' + (a*; +A ) ' 



(2.18) is the condition for the equilibrium and has to hold for any dR p and 
dR r . As a result, the partial derivatives of the distortion are identical and equal 



to — k. 



dD _ 3D 3D 

dR 3R P 3R r 



(2.19) 



[122] discusses the optimal rate allocation between motion vector rate and pre- 
diction error rate and reports the identity of the partial derivatives. Assuming 
that the total rate is constant and that an infinitesimal bit has to be assigned, the 
optimum trade-off is achieved when the decrease in distortion caused by this 
infinitesimal bit is equal for both motion-compensated prediction and residual 
encoding. 
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A memoryless Gaussian signal is assumed for the reconstruction error with 
the distortion-rate function 

D(R P , R r ) = a 2 (R p ) 2~ 2R ' = a^(R p )e- Rr2 ' n2 , (2.20) 

where o 2 (R p ) is the variance of the prediction error as a function of the pre- 
dictor rate R p . The partial derivative provides X = D(R p , R r ) 21n2. Note that 
the factor 2 In 2 is related to the slope of 6 dB/bit. At high rates, the rate of the 
predictor R p is negligible compared to the rate of the residual encoder R r and 
the distortion D(R P , R r ) is dominated by the quantizer of the residual encoder. 
A high-rate approximation for the uniform scalar quantizer 

D{R P , R r ) = ^ (2.21) 

relates the quantizer step-size Q to the distortion D(R p , R r ) and X. 

X = ^Q 2 &0.1Q 2 (2.22) 

6 

In practice, this theoretical relationship is modified to 

X « 0.2 Q 2 . (2.23) 

In [121, 123], the functional dependency X = 0.85 • QP 2 is suggested for the 
ITU-T Rec. H.263, where QP is the H.263 quantization parameter. The step- 
size of the H.263 uniform quantizer is determined by multiplying the quantiza- 
tion parameter by 2, i.e. Q = 2 • QP. This result motivates the approximation 
in (2.23). 

2.3.4 Efficient Motion Estimation 

Multiframe motion-compensated prediction and, in particular, superim- 
posed motion-compensated prediction increase the computational complexity 
of motion estimation. Fast search algorithms can be used to reduce the com- 
putational burden without sacrificing performance. 

[124] presents a fast exhaustive search algorithm for motion estimation. The 
basic idea is to obtain the best estimate of the motion vectors by successively 
eliminating the search positions in the search window and thus decreasing the 
number of matching evaluations that require very intensive computations. 

An improved approach is published in [125]. This work proposes a fast 
block-matching algorithm that uses fast matching error measures besides the 
conventional mean absolute error or mean square error. An incoming block in 
the current frame is compared to candidate blocks within the search window 
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using multiple matching criteria. The fast matching error measures are estab- 
lished on the integral projections, having the advantages of being good block 
features and having simple complexity in measuring matching errors. Most of 
the candidate blocks can be rejected only by calculating one or more of the fast 
matching error measures. The time-consuming computations of mean square 
error or mean absolute error are performed on only a few candidate blocks that 
first pass all fast matching criteria. 

One fast elimination criterion is based on the well-known triangle inequality. 
A more interesting one is outlined in the following: Let e, = s, — Sj be the 
prediction error for pixel i = 1,2 L and L be the number of pixels in a 
block. Si are the pixel values of the block in the current frame and Si the pixel 
values of the block used for prediction. For all error values e,-, the inequality 



5 H 

holds and can be simplified to 




( 2 . 24 ) 



( 2 . 25 ) 



(2.25) states that the sum-square error is always larger or equal to the normal- 
ized squared difference of block pixel sums. 



Y & ~ •h ) 2 




Y s j~Y S j 



J= l 



j= i 



( 2 . 26 ) 



These block pixel sums eliminate efficiently search positions with large errors 
but do not require time-consuming computations. Moreover, the sum for the 
incoming block in the current frame sj needs to be calculated only once. 



2.4 Theory of Motion- Compensated Prediction 

Can the future of a sequence be predicted based on its past? If so, how 
good could this prediction be? These questions are frequently encountered in 
many applications that utilize prediction schemes [126]. Video applications 
that use predictive coding are no exception and several researchers take the 
journey to explore motion-compensated prediction. For example, Cuvelier and 
Vandendorpe investigate the statistical properties of prediction error images 
[127], analyze coding of interlaced or progressive video [128], and explore 
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motion-compensated interpolation [129]. Li and Gonzales propose a locally 
quadratic model of the motion estimation error criterion function and apply 
it for subpixel interpolation [130]. Ribas-Corbera and Neuhoff use their ana- 
lytical model to optimize motion-vector accuracy in block-based video coding 
[131]. Guleryuz and Orchard investigate rate-distortion based temporal fil- 
tering for video compression [132] and provide a rate-distortion analysis of 
DPCM compression of Gaussian autoregressive sequences [133]. Pang and 
Tan examine the role of the optimum loop filter in hybrid coders [134]. Wedi 
considers aliasing and investigates in [135] a time-recursive interpolation filter 
for motion compensated prediction. He extends his work to an adaptive inter- 
polation filter [136] and suggests a theoretical basis for adaptive interpolation. 

Girod presents in [137, 12, 138] an approach to characterize motion-com- 
pensated prediction. The approach relates the motion-compensated prediction 
error to the displacement error caused by inaccurate motion compensation. The 
model utilizes a Gaussian or uniform distributed displacement error to both 
capture the average accuracy of motion-compensation and evaluate the impact 
on the motion-compensated prediction error variance. A detailed discussion of 
fractional-pel motion-compensation and the efficiency of motion-compensated 
prediction is investigated in [1], 

In [139, 11], this theory has been extended to multihypothesis motion- 
compensated prediction to investigate multiple, linearly combined motion- 
compensated prediction signals. The paper introduces a statistical model for 
multiple motion-compensated signals, also called hypotheses. Each hypoth- 
esis utilizes the statistical model in [12]. In particular, it assumes statistical 
independence among the hypotheses. With this approach, the paper continues 
discussing the optimum Wiener filter for multihypothesis motion-compensated 
prediction. 

The discussion is based on a high-rate approximation of the rate-distortion 
performance of a hybrid video codec. The high-rate approximation assumes 
that the residual encoder encodes the prediction error with very small distor- 
tion. That is, any reference frame at time instance t that will be used for pre- 
diction suffers no degradation and is identical to the original frame at time 
instance t. With this assumption, the performance of a hybrid video codec can 
be investigated dependent on the efficiency of motion-compensated prediction. 

2.4.1 Frame Signal Model 

Let V = [v[/],/ £ Id] be a scalar random field over a two-dimensional 
orthogonal grid n with horizontal and vertical spacing of 1 . The vector / = 
(jt, y) T denotes a particular location in the lattice I"I. We call this a space- 
discrete frame. 
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A frame is characterized by its autocorrelation function and power spectral 
density. The scalar space-discrete cross correlation function [140] is defined 
according to 

<PabV] = E {a[/ 0 + /]b*[/ 0 ]} (2.27) 

where a and b are complex-valued, jointly wide-sense stationary random fields, 
b is the complex conjugate of b, and /o € II is an arbitrary location. For wide- 
sense stationary random fields, the correlation function does not depend on l 0 
but only on the relative two-dimensional shift /. The cross spectral density is 
defined according to 

4>ab (0>) = T* WabM} (2.28) 

where ai = (co x , a> y ) T is the vector valued frequency and T* {■} the 2D band- 
limited discrete-space Fourier transform. In particular, the transform is 

^ b (co) = J2<t> ab [l]e^ T ‘ V coe)-n,n]x]-n,n) (2.29) 
Zen 



and its inverse is 



7T 71 

</>*bU] = ^f J <t> ab (co)e JlT “dco V l € n. (2.30) 



It is assumed that an isotropic, exponentially decaying, space-continuous 
autocorrelation function 



ct>Ux,y)=otpf^ (2.31) 

with spatial correlation coefficient p v and overall signal variance = 1 char- 
acterizes a space-continuous frame v. If we neglect spectral replications due 
to sampling, this autocorrelation function corresponds to the isotropic spatial 
power spectral density 



<t>w(0L> x ,a)y) 



2* /, + 

"o \ «o / 



(2.32) 



with coq = — ln(p v .). Please refer to Appendix A.4 for details. A spatial corre- 
lation coefficient of p y = 0.93 is typical for video signals. Band-limiting the 
space-continuous signals to the frequencies ] — n, 7r] x] — n, tt] and sampling 
them at the lattice locations FI provides the space-discrete signals. 
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2.4.2 Signal Model for Motion- Compensated Prediction 

For motion-compensated prediction, the motion-compensated signal c is 
used to predict the current frame s. The model assumes that both the cur- 
rent frame s and the motion-compensated signal c originate from the model 
frame v. The model frame captures the statistical properties of a frame without 
motion and residual noise. 




Figure 2. 7. Signal model for the current frame s and the motion-compensated signal c. 

Fig. 2.7 depicts the signal model for motion-compensated prediction. 
Adding statistically independent white Gaussian noise no to the model frame v 
generates the current frame signal s. Shifting the model frame v by the statisti- 
cally independent displacement error A = (A x , A y ) r and adding statistically 
independent white Gaussian noise Hi provides the motion-compensated sig- 
nal c. For the shift, the ideal reconstruction of the band-limited signal u[/] 
is shifted by the continuous valued displacement error A and re-sampled on 
the original orthogonal grid. The noise signals no and ni are also statistically 
independent. 

The model assumes that the true displacement is known and captures only 
the displacement error. Obviously, motion-compensated prediction should 
work best if we compensate the true displacement of the scene exactly for a 
prediction signal. Less accurate compensation will degrade the performance. 
To capture the limited accuracy of motion compensation, a vector-valued dis- 
placement error A is associated with the motion-compensated signal c. The 
displacement error reflects the inaccuracy of the displacement vector used for 
motion compensation and transmission. The displacement vector field can 
never be completely accurate since it has to be transmitted as side information 
with a limited bit-rate. The model assumes that the 2-D displacement error A 
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is isotropic Gaussian with variance a | 



Pa(A) = 



1 - 
2-na\ e 




( 2 . 33 ) 



With this assumption, the accuracy of motion compensation is now captured by 
the displacement error variance a \ . Further, it is assumed that the displacement 
error is entirely due to rounding and is uniformly distributed in the interval 
[ — 2^ — 1 , 2^ _l ] x [— 2^“' , 2 /, ~ l ], where = 0 for integer-pel accuracy, f> — 
— 1 for half -pel accuracy, /? = —2 for quarter-pel accuracy, etc. Given the 
displacement inaccuracy /3, the displacement error variance is 



2 _ 

A — 



2 ^ 
T2 ' 



( 2 . 34 ) 



The current frame s is linearly predicted from the motion-compensated sig- 
nal c. The prediction error is defined by 



e[/]=s[/]-/[/]*c[/] 



( 2 . 35 ) 



where / [Z] is the impulse response of the 2-D prediction filter. The asterisk * 
denotes 2-D convolution on the original orthogonal grid n . The filter is deter- 
mined according to the minimum mean square error criterion. The normalized 
power spectral density of the minimum prediction error is determined by 

<free(o>) = J _ 1 P(a))P*(a>) 

<J> ss (<y) l+a 0 (w) l+aq(a>) 

where P(co) is the 2-D continuous-space Fourier transform J- {p^(A)} of the 
2-D displacement error PDF p&(A). do ((o) and oq(a)) are the normalized 
power spectral densities of the residual noise in the current frame s and in 
the motion-compensated signal c, respectively. 



M") = 






( 2 . 37 ) 



Please note, that the power spectral density of the minimum prediction error 
is normalized to that of the current frame ^^(cu), whereas the power spectral 
density of the residual noise is normalized to that of the model frame <J> vv (a>). 



2.4.3 Signal Model for Multihypothesis Prediction 

For multihypothesis motion-compensated prediction, N motion-compen- 
sated signals c, t with ji — 1, 2, . . . , N are used to predict the current frame s. 
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Figure 2.8. Signal model for the current frame s and N motion-compensated prediction signals 
(hypotheses) c v . 

The model assumes that both the current frame s and N motion-compensated 
signals originate from the model frame v. The model frame captures the 
statistical properties of a frame without motion and residual noise. 

Fig. 2.8 depicts the signal model for multihypothesis motion-compensated 
prediction. Adding statistically independent white Gaussian noise no to the 
model frame v generates the current frame signal s. Shifting the model frame v 
by the statistically independent displacement error A /t and adding statistically 
independent white Gaussian noise provides the /x-th motion-compensated 
signal C fl . For the shift, the ideal reconstruction of the band-limited signal 
v[l] is shifted by the continuous valued displacement error and re-sanrpled 
on the original orthogonal grid. The noise signals and n„ with p,v — 
0, 1, . . . , N are mutually statistically independent for p. ^ v. 

The model assumes that N true displacements exist and utilizes N displace- 
ment errors to capture the limited accuracy of the N motion-compensated sig- 
nals. For that, a vector-valued displacement error A M with [i ~ 1, 2, . . . , N 
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is associated with the yti-th motion-compensated signal C M . The fx - th displace- 
ment error reflects the inaccuracy of the /x-th displacement vector used for 
multihypothesis motion compensation. The model assumes that all displace- 
ment errors A^ with /x = 1, 2, . . . , N are isotropic Gaussian with identical 
variance cr^ according to (2.33). The displacement errors and A v with 
jtt, v = 1, 2, . . . , N are mutually statistically independent for ^ v. 

The current frame s is linearly predicted from the vector of motion- 
compensated signals c = (Ci, C2, . . . , C/v) r . The scalar prediction error is 
defined by 

e[l] = s[/] - /[/] * c[/] (2.38) 

with the row-vector of impulse responses /[/] = (/i[/], f 2 [l], ■ ■ ■ , f nU ]) of 
the 2-D prediction filter. The asterisk * denotes 2-D convolution on the original 
orthogonal grid FI according to 

/[/]*<:[/] = £/[/o]c[/-/o]. (2.39) 

Zoen 



The filter is determined according to the minimum mean square error crite- 
rion. The normalized power spectral density of the minimum prediction error 
is determined by 
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For simplicity, the argument co is omitted. P^{(x>) is the 2-D continuous- 
space Fourier transform T {/?a w (A)} of the /x-th 2-D displacement error PDF 
/?A (i ( A), ao(co) and o '^(co) with fx = 1, 2, . . . , N are the normalized power 
spectral densities of the residual noise in the current frame s and in the motion- 
compensated signals c M , respectively. The normalized power spectral density 
of the residual noise is defined according to (2.37). Please note, that the power 
spectral density of the minimum prediction error is normalized to that of the 
current frame <t> ss (cu), whereas the power spectral density of the residual noise 
is normalized to that of the model frame 4> vv (o>). 



2.4.4 Performance Measures 

With high-rate assumptions, the motion-compensated prediction error e is 
sufficient for performance evaluation. As the spatial correlation of the predic- 
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tion error is only weak, the potential for redundancy reduction in the residual 
encoder is relatively small [11]. This suggests that the prediction error variance 

n n 

^ = ~ 2 j J (2.41) 

— 71 — 7T 

is a useful measure that is related to the minimum achievable transmission bit- 
rate for a given signal-to-noise ratio [141]. The minimization of the prediction 
error variance (2.41) is widely used to obtain the displacement vector and con- 
trol the coding mode in practical systems. A more refined measure is the rate 
difference [142] 

AR = ^f (2A2) 

-7i - 7 r 

In (2.42), <t> ce (m) and <F ss (a>) are the power spectral densities of the predic- 
tion error e and the current frame s, respectively. Unlike (2.41), the rate dif- 
ference (2.42) takes the spatial correlation of the prediction error e and the 
original signal s into account. It represents the maximum bit-rate reduction (in 
bits/sample) possible by optimum encoding of the prediction error e, compared 
to optimum intra-frame encoding of the signal s, for Gaussian wide-sense sta- 
tionary signals for the same mean square reconstruction error [141]. A negative 
A R corresponds to a reduced bit-rate compared to optimum intra-frame cod- 
ing, while a positive A.R is a bit-rate increase due to motion-compensation, as 
it can occur for inaccurate motion-compensation. The maximum bit-rate re- 
duction can be fully realized at high bit-rates, while for low bit-rates the actual 
gain is smaller. The rate required for transmitting the displacement estimate is 
neglected. The optimum balance between rates for the prediction error signal 
and displacement vectors strongly depends on the total bit-rate. For high rates, 
it is justified to neglect the rate for the displacement vectors [1 1], 

2.4.5 Conclusions 

Based on the simplifying assumptions, several important conclusions can be 
drawn from the theory. Doubling the accuracy of motion compensation, such 
as going from integer-pel to 1/2 -pel accuracy, can reduce the bit-rate by up to 
1 bit per sample independent of N for the noise-free case. An optimum com- 
bination of N hypotheses always lowers the bit-rate for increasing N. If each 
hypotheses is equally good in terms of displacement error PDF, doubling N 
can yield a gain of 0.5 bits per sample if there is no residual noise. If realistic 
residual noise levels are taken into account, the gains possible by doubling the 
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number of hypotheses, N, decreases with increasing N. Diminishing returns 
and, ultimately, saturation is observed. If the residual noise power increases, 
doubling and ultimately quadrupling the number of hypotheses N becomes 
more efficient than doubling the accuracy of motion compensation. The criti- 
cal accuracy beyond which the gain due to more accurate motion compensation 
is small moves to larger displacement error variances with increasing noise and 
increasing number of hypotheses N. Hence, sub-pel accurate motion compen- 
sation becomes less important with multihypothesis motion-compensated pre- 
diction. Spatial filtering of the motion-compensated candidate signals becomes 
less important if more hypotheses are combined [11]. 

2.5 Three-Dimensional Subband Coding of Video 

Hybrid video coding schemes utilize predictive coding with motion-com- 
pensated prediction for efficient compression. Such compression schemes re- 
quire sequential processing of video signals which makes it difficult to achieve 
efficient embedded representations. A multiresolution signal decomposition 
of the video signal seems to be promising to achieve efficient embedded rep- 
resentations. Mallat [148] discusses the wavelet representation as a suitable 
tool for multiresolution signal decomposition. Moreover, wavelets are also 
suitable for coding applications [149, 150]. For example, Shapiro proposes 
zerotrees of wavelet coefficients for embedded image coding [151]. Usevitch 
derives optimal bit allocations for biorthogonal wavelet coders which result in 
minimum reconstruction error [152]. Taubman proposes a new image compres- 
sion algorithm based on independent Embedded Block Coding with Optimized 
Truncation of the embedded bit-stream (EBCOT) [153]. 

Practical wavelet coding schemes are characterized by the construction of 
the wavelets. Daubechies and Sweldens factor wavelet transforms into lifting 
steps [154-156]. This construction scheme permits wavelet transforms that 
map integers to integers [157], a desirable property for any practical coding 
scheme. Moreover, the lifting scheme permits also adaptive wavelet transforms 
[158, 159]. Adaptive schemes like motion compensation can be incorporated 
into the lifting scheme. 

2.5.1 Motion Compensation and Subband Coding 

For hybrid video coding, there are attempts to keep the efficient predictive 
architecture with motion compensation and to use wavelet-based techniques 
for coding the displaced frame difference, e.g. [160]. These are spatial sub- 
band coding techniques and do not provide three-dimensional subbands. Early 
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attempts of three-dimensional subband coding did not employ motion com- 
pensation. Knauer applied in [18] the Hadamard transform for real-time tele- 
vision compression by considering the sequence of frames as a volume, Fur- 
ther research is published by Vetterli et al. [161], Biemond et al. [162], and 
Jayant et al. [163]. Interestingly, Zhang et al. investigate memory-constrained 
3D wavelet transforms without boundary effects and use the lifting structure 
[164, 165]. BARLAUD et al. suggest a 3D scan based wavelet transform with re- 
duced memory requirements [166]. 3D wavelet coding with arbitrary regions 
of support is discussed by Minami et al. [167]. 

Three-dimensional coding with motion compensation has first been sug- 
gested by Kronander [19, 168]. Akiyama et al. use global motion compensa- 
tion and three-dimensional transform coding [169]. Global motion compensa- 
tion with 3D wavelet coding is also investigated by Chou et al. in [170]. Zhang 
and Zafar present a motion-compensated wavelet transform coder for color 
video compression [171]. Ohm starts his investigation of 3D subband video 
coding [24, 172] with integer-pel accurate motion compensation and first order 
quadrature mirror filter in the temporal domain [173, 174]. Woods et al. dis- 
cuss a scheme for object-based spatio-temporal subband coding [175]. They 
optimize the trade-off of the rate between motion vectors and 3-D subbands 
[25] and consider digital cinema applications [176]. A bit allocation scheme 
for subband compression of HDTV is published in [177]. For video communi- 
cation over wireless channels, Chou and Chen propose a perceptually optimized 
3-D subband codec [178]. 

The multiresolution signal decomposition of the video signal permits tem- 
poral, spatial, and rate-distortion scalability. This important feature is investi- 
gated by many researchers. Uz et al. suggest a scheme for interpolative mul- 
tiresolution coding of digital HDTV [179, 180]. In addition to the multireso- 
lution representation of the motion-compensated three-dimensional signal, re- 
searchers investigate also the multiresolution representation of motion, like 
Ohm [181] and Zhang et al. [182], Taubman and Zakhor propose a common 
framework for rate and distortion based scaling of highly scalable compressed 
video [183, 184, 23, 185, 186]. Pearlman et al. introduce in [187, 188] an 
embedded wavelet video codec using three-dimensional set partitioning in hi- 
erarchical trees [189]. Woods et al. discuss rate-constrained multiresolution 
transmission of video [190], and present a resolution and frame-rate scalable 
subband video coder [191]. Ranganath et al. outline a highly scalable wavelet- 
based codec for very low bit-rate environments and introduces tri-zerotrees 
[192]. Pesquet-Popescu et al. suggest a method for context modeling in the 
spatio-temporal trees of wavelet coefficients [193] and propose the strategy of 
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fully scalable zerotree coding [194]. Zhang et al. investigate three-dimensional 
embedded block coding with optimized truncation [195]. 

2.5.2 Motion-Compensated Lifted Wavelets 

The previously mentioned lifting scheme permits adaptive wavelet trans- 
forms. For the temporal subband decomposition, Pesquet-Popescu and Bot- 
treau improve the compression efficiency with lifting schemes that use mo- 
tion compensation [196]. Zhang et al. also incorporate motion compensation 
into the lifting scheme [197] and use this method for 3D wavelet compression 
of concentric mosaics [198]. Secker and Taubman investigate lifting schemes 
with block motion compensation [20] and deformable mesh motion compen- 
sation [199]. Ohm reviews the recent progress and discusses novel aspects 
with respect to irregular sampled signals, shift variance of wavelet transforms 
and non-dyadic wavelet processing [200]. Pateux et al. investigate several 
motion-compensated lifting implementations and compare to standardized hy- 
brid codecs [201]. Barlaud et al. extend their 3D scan-based wavelet transform 
codec and use the motion-compensated lifting scheme [202]. 




Chapter 3 



MOTION-COMPENSATED PREDICTION 
WITH COMPLEMENTARY HYPOTHESES 



3.1 Introduction 

As discussed in Section 2.4, the theoretical investigation in [11] shows that 
a linear combination of multiple motion-compensated signals can improve the 
performance of motion-compensated prediction for video coding. This chapter 
extends that work by introducing the concept of complementary hypotheses 
[203, 204], 

To motivate this concept, let us consider pairs of motion-compensated sig- 
nals. The two signals are simply averaged to form the prediction signal. We 
ask the question what kind of pairs are necessary to achieve the best predic- 
tion performance of superimposed motion compensation. If a pair consists 
of two identical hypotheses, the superimposed prediction signal is identical 
to either one of the hypotheses and we expect no improvement over motion- 
compensated prediction with just one hypothesis. But, in general, there will 
be pairs of hypotheses that outperform motion-compensated prediction with 
single hypotheses. Our approach is to model the dependency among the two 
signals by a correlation coefficient and investigate its impact on the perfor- 
mance of superimposed motion compensation. 

Our assumption that there will be A'-tuples of hypotheses that outperform 
motion-compensated prediction with single hypotheses is supported by exper- 
imental results. As discussed in Section 2.3.2, the work in [15, 16] demon- 
strates experimentally that such efficient /V- tuples exist. The work proposes an 
iterative algorithm for block-based rate-constrained superimposed motion esti- 
mation. The algorithm improves conditionally optimal solutions and provides 
a local optimum for the joint estimation problem. The results demonstrate 
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that joint estimation of hypotheses is important for superimposed motion- 
compensated prediction. 

The outline of this chapter is as follows: Section 3.2 extends the known 
model of multihypothesis motion-compensated prediction. Correlated dis- 
placement error for superimposed prediction are discussed and the concept 
of motion compensation with complementary hypotheses is introduced. Fur- 
ther, we discuss the gradient of the prediction error variance for superim- 
posed motion-compensated signals and the impact of a particular frame signal 
model. Section 3.3 analyzes “noisy” hypotheses, investigates both averaging 
and Wiener filtering, and provides performance results for jointly estimated 
hypotheses. Section 3.4 explores hypothesis switching as a method to select ef- 
ficient hypotheses for superimposed motion-compensated prediction. First, the 
signal model for forward-adaptive hypothesis switching is introduced. Second, 
the problem is approached by minimizing the radial displacement error. Third, 
a property of the assumed PDF allows the definition of an equivalent predictor. 
And finally, forward-adaptive hypothesis switching is combined with superim- 
posed motion-compensated prediction. Section 3.5 discusses image sequence 
coding where individual images are predicted with a varying number of hy- 
potheses. The impact of superimposed motion estimation on the overall coding 
efficiency is investigated. 

3.2 Extended Model for Superimposed 
Motion-Compensated Prediction 

3.2.1 Superimposed Prediction and Correlated 
Displacement Error 

We extend the model for multihypothesis motion-compensated prediction 
as discussed in Section 2.4 such that correlated displacement errors can be 
investigated. Let s[/] and [/] be scalar two-dimensional signals sampled on 
an orthogonal grid with horizontal and vertical spacing of 1. The vector l = 
(x, y) T denotes the location of the sample. For the problem of superimposed 
motion compensation, we interpret as the /x-th of N motion-compensated 
signals available for prediction, and s as the current frame to be predicted. We 
also call the /x-th hypothesis. 

Motion-compensated prediction should work best if we compensate the true 
displacement of the scene exactly for a prediction signal. Less accurate com- 
pensation will degrade the performance. To capture the limited accuracy of 
motion compensation, we associate a vector-valued displacement error A M 
with the /tx-th hypothesis c M . The displacement error reflects the inaccuracy 
of the displacement vector used for motion compensation and transmission. 
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The displacement vector field can never be completely accurate since it has 
to be transmitted as side information with a limited bit-rate. For simplicity, 
we assume that all hypotheses are shifted versions of the current frame signal 
s. The shift is determined by the vector-valued displacement error A M of the 
fx-th hypotheses. For that, the ideal reconstruction of the band-limited signal 
s[Z] is shifted by the continuous valued displacement error and re-sampled on 
the original orthogonal grid. For now, the translatory displacement model as 
depicted in Fig. 2.8 omits “noisy” signal components. 

A superimposed motion-compensated predictor forms a prediction signal by 
averaging N hypotheses c^/] in order to predict the current frame signal s[/]. 
The prediction error for each pel at location / is the difference between the 
current frame signal and N averaged hypotheses 

1 N 

e[/] = s [/] --£>[/]. (3.1) 

/i=i 



Assume that s and C fl are generated by ajointly wide-sense stationary ran- 
dom process with the real-valued scalar two-dimensional power spectral den- 
sity d> ss (m) as well as the cross spectral densities < t> C(xS (cn) and <Fc w c v (<w). Power 
spectra and cross spectra are defined according to (2.28) where a> = (co x , co y ) T 
is the vector valued frequency. 

The power spectral density of the prediction error in (3.1) is given by the 
power spectrum of the current frame and the cross spectra of the hypotheses 

2 N . N N 

4>ee(^) = 4>«(o>) - - £ SR {<Ma>)) + ^2 E E ‘VvM. (3-2) 

/T=l fl=\ V=1 

where $R{-} denotes the real component of the, in general, complex valued cross 
spectral densities < t , C/i s(<w)- We adopt the expressions for the cross spectra from 
[11], where the displacement errors are interpreted as random variables 
which are statistically independent from s: 





= <t> ss (co)E 


(3.3) 


(<y) 


= <3> ss (ct>)£T { e - J r" r( ^-A v ) j 


(3.4) 



Like in [11], we assume a power spectrum <F SS that corresponds to an exponen- 
tially decaying isotropic autocorrelation function with a correlation coefficient 
Ps- 

For the p,-th displacement error A^, a 2-D stationary normal distribution 
with variance a\ and zero mean is assumed where the x- and y-componcnts 




38 



Video Coding with Superimposed Motion-Compensated Signals 



are statistically independent. The displacement error variance is the same for 
all N hypotheses. This is reasonable because all hypotheses are compensated 
with the same accuracy. Further, the pairs (A^, A„) are assumed to be jointly 
Gaussian random variables. The predictor design in [16] showed that there 
is no preference among the N hypotheses. Consequently, the correlation co- 
efficient p A between two displacement error components A X/X and A Xl) is the 
same for all pairs of hypotheses. The above assumptions are summarized by 
the covariance matrix of a displacement error component. 



Ca x a x = 



1 

Pa 



Pa 

1 



Pa \ 
Pa 



(3.5) 



\ Pa Pa 1 ' 1 / 

Since the covariance matrix is nonnegative definite [205], the correlation coef- 
ficient /Oa in (3-5) has the limited range 



1 

1 - N 



< pa < 1 for N = 2, 3, 4, 



(3.6) 



which is dependent on the number of hypotheses N. More details on this result 
can be found in Appendix A.l. In contrast to [11], we do not assume that the 
displacement errors A M and A„ are mutually independent for pi ^ v. 

These assumptions allow us to express the expected values in (3.3) and (3.4) 
in terms of the 2-D Fourier transform P of the continuous 2-D probability 
density function of the displacement error A^. 

Ejg-V*,) = j PAii (A) e -j“ TA dA 

TV 

= P(<0, o\) 

= e ~i wTaa l (3.7) 



The expected value in (3.4) contains differences of jointly Gaussian random 
variables. The difference of two jointly Gaussian random variables is also 
Gaussian. As the two random variables have equal variance a\, the variance 
of the difference signal is given by a 2 = 2cr^(l — p&). Therefore, we obtain 
for the expected value in (3.4) 

E = p (<y, 2cr^(l - pa)) for fi £ v. (3.8) 



For pt = v, the expected value in (3.4) is equal to one. With that, we obtain for 
the power spectrum of the prediction error in (3.2): 



^ee(^) 

4>ss(") 



N + 1 - 2 P(co, cr\) + ^77 — P (w, 2 ct£(1 - ptS) 



N 



N 



(3.9) 
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Setting Pa = 0 provides a result which is presented in [11], equation (23), 
with negligible noise a ^ = 0, averaging filter F, and identical characteristic 
functions — P. 



3.2.2 Complementary Hypotheses 

The previous section shows that the displacement error correlation coeffi- 
cient influences the performance of superimposed motion compensation. An 
ideal superimposed motion estimator will select sets of hypotheses that opti- 
mize the performance of superimposed motion compensation. In the following, 
we focus on the relationship between the prediction error variance 



71 TV 

<-£// 



<$> ee (eo)dco 



(3.10) 



and the displacement error correlation coefficient. The prediction error vari- 
ance is a useful measure because it is related to the minimum achievable trans- 
mission bit-rate [11]. 




Figure 3. 1. Normalized prediction error variance for superimposed MCP over the displace- 
ment error correlation coefficient p Reference is the single hypothesis predictor. The hy- 
potheses are averaged and no residual noise is assumed. The variance of the displacement error 
is set to cr^ = 1/12. 

Fig. 3.1 depicts the dependency of the normalized prediction error variance 
on the displacement error correlation coefficient within the range (3.6). The 
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dependency is plotted for N = 2, 4, 8, and 00 for integer-pel accurate motion 
compensation (cr^ = 1/12). The correlation coefficient of the frame signal 
p s = 0.93 [11]. Reference is the prediction error variance of the single hypoth- 
esis predictor o£\. We observe that a decreasing correlation coefficient lowers 
the prediction error variance. (3.9) implies that this observation holds for any 
displacement error variance. Fig. 3.1 shows also that identical displacement 
errors (p\ = 1) do not reduce the prediction error variance compared to single 
hypothesis motion compensation. This is reasonable when we consider iden- 
tical hypotheses. They do not improve superimposed motion-compensation 
because they have identical displacement errors. 

To determine the performance bound, we assume the existence of all N- 
tuple of hypotheses that obey (3.6) and that an ideal superimposed motion 
estimator is able to determine any desired /V-tuple. Assuming a mean square 
error measure, the optimal ideal superimposed motion estimator minimizes the 
summed squared error 

min-J- (3-H) 

^ leC 

and the expected value 



min E 



— Y'' e 2 [/] 



leC 



(3.12) 



where e[Z] denotes the prediction error at pixel location I. In addition, we 
assume a stationary error signal such that 

£{e 2 [/]}=a e 2 [/]=< A V/. (3.13) 



Consequently, this optimal ideal estimator minimizes the prediction error vari- 
ance. 

minor 2 (3.14) 

Further, <7 2 increases monotonically for increasing p\. This is a property of 
(3.9) which is also depicted in Fig. 3.1. The minimum of the prediction error 
variance is achieved at the lower bound of p\. 



min p A (3.15) 

That is, an optimal ideal superimposed motion estimator minimizes the pre- 
diction error variance by minimizing the displacement error correlation coeffi- 
cient. Its minimum is given by the lower bound of the range (3.6). 
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1 -N 
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for N = 2,3,4,... 



(3.16) 




Motion-Compensated Prediction with Complementary Hypotheses 



41 



This insight implies an interesting result for the case N = 2: Pairs of hy- 
potheses that operate on the performance bound show the property that their 
displacement errors are maximally negatively correlated. Hence, the combina- 
tion of two complementary hypotheses is more efficient than two independent 
hypotheses. 






Figure 3.2. Interpolation and displacement error. Due to an inaccurate displacement, only the 
signal value at spatial location A i is available (left). Averaging two hypotheses with identical 
displacement errors does not improve the approximation (middle). When we pick the signal 
value at spatial location A 2 = — Aj and average the two signal values, we will get closer to the 
signal value at spatial location l x = 0 (right). 

Let us consider the one-dimensional example in Fig. 3.2 where the intensity 
signal s is a continuous function of the spatial location l x . A signal value 
that we want to use for prediction is given at spatial location l x = 0. Due to 
an inaccurate displacement, only the signal value at spatial location l x = Ai 
is available. We assume that the intensity signal is smooth around l x = 0 
and not spatially constant. When we pick the signal value at spatial location 
l x = A 2 = — Ai and average the two signal values, we will get closer to the 
signal value at spatial location l x = 0. If we consider many displacement error 
values A] with distribution p&, we get for the random variables Ai = — A 2 . 
This results in = — 1 . 

Fig. 3.3 depicts the rate difference A R for multihypothesis motion- 
compensated prediction over the displacement inaccuracy for statistically 
independent displacement errors according to [11]. The rate difference ac- 
cording to (2.42) represents the maximum bit-rate reduction (in bits/sample) 
possible by optimal encoding of the prediction error e, compared to optimum 
intra-frame encoding of the signal s for Gaussian wide-sense stationary sig- 
nals for the same mean square reconstruction error. A negative rate difference 
A R corresponds to a reduced bit-rate compared to optimum intra- frame cod- 
ing. The horizontal axis in Fig. 3.3 is calibrated by /5 = log 2 (-\/T2o'A)> where 

= 0 for integer-pel accuracy, fi — —\ for half-pel accuracy, /3 = —2 for 
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displacement inaccuracy p 



Figure 3.3. Rate difference for superimposed MCP over the displacement inaccuracy f) for 
statistically independent displacement errors. The hypotheses are averaged and no residual 
noise is assumed. 



quarter-pel accuracy, etc [11]. The displacement error variance is given by 
(2.34). 

We observe in Fig. 3.3 that for the case N - 1 the slope reaches 1 bit per 
sample per inaccuracy step. This can also be observed in (3.9) for N = 1 when 
we apply a Taylor series expansion of first order for the function P. 

w for a l “► 0, N = 1 (3.17) 

<M<») 

Inserting this result in (2.42) supports the observation in Fig. 3.3 

AR « + const, for a\ -*0, N = l. (3.18) 



We observe also in Fig. 3.3 that doubling the number of hypotheses de- 
creases the bit-rate up to 0.5 bits per sample and the slope reaches up to 1 bit 
per sample per inaccuracy step. The case (V — * 00 achieves a slope up to 2 bits 
per sample per inaccuracy step. This can also be observed in (3.9) for N — >• 00 
when we apply a Taylor series expansion of second order for the function P 



*«(*>) 




for a\ — ► 0, N 



oo, p A = 0. 



(3.19) 



Inserting this result in (2.42) supports the observation in Fig. 3.3 



AR & 2fi 4- const, for —>■ 



0, N -> oo, p A = 0. 



(3.20) 
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Figure 3.4. Rate difference for superimposed MCP over the displacement inaccuracy p for 
optimized displacement error correlation. The hypotheses are averaged and no residual noise is 
assumed. 



Fig. 3.4 depicts the rate difference for superimposed motion-compensated 
prediction over the displacement inaccuracy fi for optimized displacement er- 
ror correlation according to (3.16). We observe for accurate motion compen- 
sation that the slope of the rate difference of 2 bits per sample per inaccuracy 
step is already reached for N = 2. For increasing number of hypotheses the 
rate difference converges to the case N —*■ 00 at constant slope. This can also 
be observed in (3.9) when the displacement error correlation coefficient is set 
to pa = yrjv> and a Taylor series expansion of second order for the function 
P(co) is applied. 



<J>ecM 




N + 1 
N - 1 



for a\ 0, N = 2, 3, . . . 



(3.21) 



Inserting this result in 2.42 supports the observation in Fig. 3.4 that for N = 
2, 3, ... the slope reaches up to 2 bits per sample per inaccuracy step. 



A R 



2/3 + - log 2 




+ const. 



for a\ 



0, N = 2, 3, . . . (3.22) 



For very accurate motion compensation o\ -4 0 and N = 2, 3, . . . , doubling 
the number of hypotheses results in a rate difference of 



AR 2 := AR(2N) - A R(N) * 



1 / 2N 2 — N — l\ 

2 ° g2 + W “ 1 / ' 



(3.23) 




44 



Video Coding with Superimposed Motion-Compensated Signals 



For a very large number of hypotheses, the rate difference for doubling the 
number of hypotheses A /?2 converges to zero. Consequently, the prediction 
gain by optimum noiseless multihypotheses prediction with averaging filter is 
limited and the rate difference converges to the case N — » 00 at constant slope. 

We obtain for the band-limited frame signal the following result: the gain of 
superimposed motion-compensated prediction with jointly optimal motion es- 
timation over motion-compensated prediction increases by improving the ac- 
curacy of motion compensation for each hypothesis. The theoretical results 
suggest that a practical video coding algorithm should utilize two jointly esti- 
mated hypotheses. Experimental results also suggest that the gain by superim- 
posed prediction is limited and that two jointly estimated hypotheses provide a 
major portion of this achievable gain. 

3.2.3 Gradient of the Prediction Error Variance 

We observe in Fig. 3.1 that the prediction error variance a 2 increases mono- 
tonically for increasing displacement error correlation coefficient p A . In the 
following, we investigate this in more detail and show that this dependency is 
independent of a particular frame signal model, i.e., a particular frame auto- 
correlation function. 

Again, let s(Z) and (!) be generated by a jointly wide-sense stationary, 
two-dimensional random process. The vector / = (x,y) T denotes the loca- 
tion in 1Z 2 . Let C M be the pt-th of N motion-compensated signals available for 
prediction, and s be the current frame to be predicted. The limited accuracy 
of motion compensation is captured by associating a vector-valued displace- 
ment error with the /x-th hypothesis c M . For simplicity, we assume that 
all hypotheses are shifted versions of the current frame signal s. The shift is 
determined by the vector- valued displacement error A M of the /x-th hypotheses 
such that C M (/) = S (/ — A m ). 

The superimposed motion-compensated predictor forms a prediction signal 
by averaging N hypotheses c ^(l) in order to predict the current frame signal 
s (/). The prediction error at location / is the difference between the current 
frame signal and N averaged hypotheses 



e(/) =s (/)-I^ s( /_ a m ). 

h=i 



( 3 . 24 ) 
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As we assume wide-sense stationary signals, the prediction error variance 
<7 2 = E { e 2 (/) } is independent of the location 1 

N | N N 

{0ss(A /i )} + Y2 E E E - A v)}, 0.25) 

H=l /i=l v=l 

but is dependent on the scalar space-continuous autocorrelation function 

= £{s(/o + /)s*(/ 0 )}. (3.26) 

The autocorrelation function of a wide-sense stationary random process does 
not depend on the absolute location /o but only on the relative two-dimensional 
shift l. 

For the /x-th displacement error A ;i , a 2-D stationary normal distribution 
with variance a ^ and zero mean is assumed where the x- and y-components 
are statistically independent. The displacement error variance is the same for 
all N hypotheses. Further, the pairs (A M , A v ) are assumed to be jointly Gaus- 
sian random variables. The correlation coefficient p A between two displace- 
ment error components A Xfl and A XI , is the same for all pairs of hypotheses. 
With these assumptions, the covariance matrix of a displacement error compo- 
nent is given by (3.5). Since the covariance matrix is nonnegative definite, the 
correlation coefficient p A has the limited range according to (3.6). 

For the expected values in (3.25), we define a function g(coo, a^) that is only 
dependent on the spatial correlation of the frame signal p s = exp(— a>o) and the 
displacement error variance . Further, we exploit the fact that the difference 
of two jointly Gaussian random variables is also Gaussian. 

E {<MA,J } = crfg (co 0 , ol) for p = 1 , 2, . . . , N (3.27) 
E {0 ss (A m -Ay)} = a]g (coo, 2*1(1 - p A )) for p £ v (3.28) 

With (3.27) and (3.28), the prediction error variance in (3.25) can be normal- 
ized to the frame signal variance a 2 . Note that for pt = v the expected value 
E {</> ss (A M — A y )} is equal to the variance of the frame signal a 2 . 

- 2 g (co 0 , °l) + (co 0 , 2*1(1 - Pa)) (3.29) 

for 

"j ~T7 < PA < 1. (3.30) 

With the function g in (3.27), we can show that for N > 1 the prediction 
eiTor variance a 2 increases monotonically with the displacement error corre- 
lation coefficient p A . We know for the single hypothesis predictor that the 
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prediction error variance increases monotonically with the displacement error 
variance a^. For N = 1, (3.29) implies 




2 dg(co 0 ,(rl) 

-la. t 

S da 2 A 



> 0 , 



(3.31) 



that is, the partial derivative of g with respect to is negative. Next, we 
calculate the partial derivative of the prediction error variance with respect to 
the displacement error correlation coefficient p^ for N > 1 and observe that 
this partial derivative is positive. 



9a c 2 ^ 2 _ 2 N - \ dg(co 0 ,2al(\ - p A )) n 
dp* AS N d2a 2 A (\-p A ) 



(3.32) 



Consequently, the prediction error variance a l increases monotonically with 
the displacement error correlation coefficient p^ independent of a particular 
underlying frame signal model. 



3.3 Hypotheses with Additive Noise 

To consider signal components that cannot be modeled by motion compen- 
sation, statistically independent noise is added to each motion-compensated 
signal. Further, we assume that the current frame s originates from a noise-free 
model video signal v but is also characterized by statistically independent ad- 
ditive Gaussian noise no [11]. We characterize the residual noise for each 
hypothesis by the power spectal density (&0 and the residual noise no 
in the current frame by 4> nono (a)). For convenience, we normalize the noise 
power spectra with respect to the power spectral density of the model video 
signal 4> vv (<w). 

a„(co) = Vp = 0, \ ,2, . , N (3.33) 

4>w(o>) 



Fig. 3.5 depicts the model for motion-compensated signals with statistically 
independent additive noise and linear filter. All hypotheses are jointly filtered 
to determine the final prediction signal. The linear filter is described by the 
vector valued transfer function F(a>). In particular, F(a>) is a row vector with 
N scalar transfer functions. The power spectrum of the prediction error with 
the linear filter is 



T'ecjm) — 



(3.34) 



4> ss (m) - 4> sc (eo)F H (a>) - F(co) <t B (<y) + F(co)<t> cc {co)F H (w). 
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no 




Figure 3.5. Two-hypothesis motion-compensated prediction with linear filter. 

In the following, we investigate the performance of motion compensation 
with complementary hypotheses for both the averaging filter and a Wiener fil- 
ter. 



3.3.1 Averaging Filter 

The averaging filter weights each hypothesis equally with the constant factor 
We use the notation of the column vector 1 denoting that all entries are 
equal to one. 

F(co) = il r (3.35) 

We evaluate the performance of motion compensation with complementary 
hypotheses for the averaging filter in (3.35) by calculating the rate difference 
in (2.42). In order to do this, we normalize the power spectral density of the 
prediction error and substitute the power spectra in (3.34) with (3.3), (3.4), 
(3.7) and (3.8). Further, we assume individual power spectral densities for the 
residual noise. 



4>eeM _ 

^ss (&>) 

t N + «/i(a>) 

JV 2 ( 1 +a 0 (o>)) 



(3.36) 

2 P(co,al) N- 1 P(co,2txl(l-p A )) 

l+Q!o(«y) N l+ao(<y) 



Experimental results for the predictor design in [15, 16] show that all N hy- 
potheses contribute equally well. Based on this observation, we assume that the 
noise power spectral densities are identical for all N hypotheses. To simplify 
the model, we assume also that they are identical to the noise power spectrum 
of the current frame, i.e., a^{co) — a'o(m) for /x = 1, 2, . . . , N. With these 
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assumptions, the normalized prediction error power spectrum for the superim- 
posed predictor with averaging filter reads 

geeM = jV + l _ 2 P(c0 t cfl) N- 1 P(m,2aj(l -p A )) 

<!> ss (w) N l+aoM N l+a 0 (<w) 

If there is no noise, i.e., oto(a>) = 0, we obtain the result in (3.9). 




Figure 3.6. Rate difference for motion compensation with complementary hypotheses and 
averaging filter over the displacement inaccuracy p. Residual noise level RNL = -100 dB. 

Figs. 3.6 and 3.7 depict the rate difference for motion compensation with 
complementary hypotheses over the displacement inaccuracy ft at a. residual 
noise level of -100 dB and -30 dB, respectively. The residual noise level is 
defined by RNL =10 log 10 (a^) where a * is the residual noise variance. As 
suggested in [11], we assume a constant power spectral density for the residual 
noise. For the plotted range of the motion inaccuracy, a residual noise level 
of RNL = -100 dB indicates that the residual noise is negligible and the per- 
formance is similar to the noiseless case as shown in Fig. 3.4. At a residual 
noise level of RNL = -30 dB. the rate difference saturates beyond 1/8-th pel 
accuracy (ft — —3). This is more practical. We observe for motion compensa- 
tion with complementary hypotheses that the rate difference saturates even at 
quarter-pel accuracy. Consequently, we can achieve similar prediction perfor- 
mance at lower compensation accuracy when utilizing motion compensation 
with complementary hypotheses. Regardless of the accuracy of superimposed 
motion compensation, the rate difference improves for an increasing number 
of hypotheses due to the noise suppression by the averaging filter. 
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Figure 3.7. Rate difference for motion compensation with complementary hypotheses and 
averaging filter over the displacement inaccuracy /9. Residual noise level RNL = -30 dB. 

3.3.2 Wiener Filter 

The optimum Wiener filter minimizes the power spectral density of the pre- 
diction error. This is a well-known result [205] and we adopt the expression 
for multihypothesis motion-compensation from [11]. The vector-valued opti- 
mum filter F 0 (co) is the product of the Hermitian conjugate of the cross spec- 
trum, and the inverse power spectral density matrix of the hypotheses 

F 0 (co) = <t>Z(co)<t>; e \a>) (3.38) 

This expression minimizes the power spectral density of the prediction error 
and its minimum is given by 

= <&«(<«) - 4>Z(co)<l>- l (a>)<t> a (co). (3.39) 

To evaluate the performance of the predictor with optimum Wiener filter, we 
have to specify the vector-valued cross spectrum (a>) and the power spectral 
density matrix of the hypotheses 4> cc (ftj). In the following, we analyze the 
influence of motion compensation with complementary hypotheses on both 
the Wiener filter and its prediction performance. 

For motion compensation with complementary hypotheses and (3.7), the 
vector of the cross spectra is simply 

$cs(") = l/ , (<w,or2)4>w(ft>). 



(3.40) 
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The cross spectra do not include the power spectral densities of the residual 
noise as we assume that the individual noise signals are mutually statistically 
independent. The same argument holds for the non-diagonal entries in the 
matrix <t> cc which are determined by the characteristic function of the displace- 
ment error PDF with the displacement error correlation coefficient p\ accord- 
ing to (3.8). The diagonal entries in the matrix 4> cc are characterized by the 
power spectral densities of the hypotheses which include the power spectral 
densities of the residual noise. We write the matrix 4> cc as the sum of the ma- 
trix 11 T and the diagonal matrix diag(-) as this representation is useful for the 
following discussion. 11 T is the N x N matrix with all entries equal to one. 
With that, the power spectral density matrix for N hypotheses becomes 



$cc(<y) 



ll r + diag 



/ 1 + otj jco) 

V 




^(<u)<t> vv (<y) 



( 3 . 41 ) 



where P p (co) = P (co, 2cr|(l — Pa)) abbreviates the characteristic function of 
the displacement error PDF with the displacement error correlation coefficient 
Pa- Q!/(a>) represents the normalized power spectral density of the residual 
noise in the i-th hypothesis. 

The Wiener solution requires the inverse of the power spectral density ma- 
trix 4 , cc (a>). An analytical expression is derived in Appendix A.3 and the opti- 
mum Wiener filter according to (3.38) yields 



F 0 (co) = 



1 

WRco) - 1 



P(<0, oj) 

Ppico) 



b T (co) 



( 3 . 42 ) 



with the vector 



b T (co) = ( 3 . 43 ) 

-Ppico) -Ppico) -Ppico) 

1 + at (ft)) - Ppico) ' 1 + a 2 ((b) -Pp(co)’"" 1 + a N (co) - Ppico) 

Please note that only the normalized noise spectrum a p (cu) differs in the com- 
ponents of bico) and that the contribution from inaccurate motion compensa- 
tion is the same in each component. The expression for the power spectral 
density of the prediction error in (3.39) incorporates explicitly the solution for 
the Wiener filter in (3.38) such that <t> ee = 4> ss — After normalization 

with 4 > ss (<y) = ( 1 +ao(<zO) < Fvv(w), the power spectral density of the prediction 
error reads 

4>ee(fc>) _ | 1 1 T bjco) P 2 ico,crl) 

1 + ao (co) l T b(co) - 1 Ppico) 




( 3 . 44 ) 
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As a reminder, «o {(d) denotes the normalized power spectral density of the 
residual noise in the current frame. 



If the noise energy in all the hypotheses and the current frame is the same, 
that is, ot^co) = a((o) for pi = 0 , 1 , .... N, we obtain for the optimum Wiener 
filter 



F 0 (co) = 



P(to, oj) 

1 +a(a>) + (N- l)P p (fi>) 



1 T , 



( 3 . 45 ) 



and for the normalized power spectral density of the prediction error 



4>eeM t 1 NP 2 (o), o\) 

<*>«(<«) 1 + a (co) 1 +a(a>) + (N — \)P p ((o)' 



( 3 . 46 ) 



Based on this result, we investigate the influence of motion compensation with 
complementary hypotheses on both the Wiener filter and its prediction perfor- 
mance. 




Figure 3.8, Rate difference for motion compensation with complementary hypotheses and 
Wiener filter over the displacement inaccuracy p. Residual noise level RNL = -100 dB. 

Fig. 3.8 depicts the rate difference for motion compensation with com- 
plementary hypotheses and Wiener filter over the displacement inaccuracy /3 
and negligible residual noise. For N > 1, the graphs show that doubling 
the number of complementary hypotheses decreases the bit-rate at least by 
0.5 bits per sample and the slope reaches up to 2 bits per sample per inac- 
curacy step. This can also be observed in (3.46) when the residual noise is 
neglected, i.e. a(a>) 0, the displacement error correlation coefficient is set 
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to = 3377 , and a Taylor series expansion of second order is applied for the 
function P(co, cr^). 

» ^(m r a>) 2 — - for ct 2 -> 0 (3.47) 

4>ss(a>) 2 N - 1 

Inserting this result into 2.42 supports the observation in Fig. 3.8 that the slope 
reaches up to 2 bits per sample per inaccuracy step 

A R 2/8 + ^ log 2 ^ 1 ^ + cwm . for a\ 0. (3.48) 

For very accurate motion compensation, doubling the number of hypotheses 
results in a rate difference of 

AR 2 :=AR(2N)-AR(N)* l -\og 2 (J^^ for 0. (3.49) 

For a very large number of hypotheses, the rate difference for doubling the 
number of hypotheses ARi converges to -0.5 bits per sample. Consequently, 
the prediction error variance of the optimal noiseless superimposed predictor 
with Wiener filter converges to zero for an infinite number of hypotheses. 




Figure 3.9. Rate difference for motion compensation with complementary hypotheses and 
Wiener filter over the displacement inaccuracy f). Residual noise level RNL = -30 dB. 

Fig. 3.9 shows the rate difference for motion compensation with comple- 
mentary hypotheses and Wiener filter over the displacement inaccuracy at a 
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residual noise level of -30 dB. Similar to the averaging filter for motion com- 
pensation with complementary hypotheses in Fig. 3.7, the rate difference sat- 
urates even at for quarter-pel accuracy. And for very accurate superimposed 
motion compensation, the rate difference improves for an increasing number 
of hypotheses due to the noise suppression by the Wiener filter but saturates for 
N — ► OO. In contrast to the averaging filter, the Wiener filter with a very large 
number of hypotheses is able to eliminate the influence of motion inaccuracy. 
This can be observed in (3.46) for N — ► 00. In this case, the normalized power 
spectral density of the prediction error yields 



^ee(&>) j 1 

^ssM 1 + a (to) 



for N — ► oo. 



(3.50) 




displacement Inaccuracy (5 



Figure 3, 10. Rate difference for superimposed motion-compensated prediction with Wiener 
filter over the displacement inaccuracy fS for both optimized displacement error correlation and 
statistically independent displacement error. Residual noise level RNL = -30 dB. In all cases, 
the optimum filter is applied. 

Fig. 3.10 depicts the rate difference for superimposed motion-compensated 
prediction over the displacement inaccuracy fi for both optimized displacement 
error correlation and statistically independent displacement error. The residual 
noise level is chosen to be -30 dB. For half-pel accurate motion compensation 
and 2 hypotheses, we gain about 0.6 bits/sample in rate difference for motion 
compensation with complementary hypotheses over statistically independent 
displacement error. This corresponds to a prediction gain of about 3.6 dB. 





54 



Video Coding with Superimposed Motion-Compensated Signals 



So far, we investigated the prediction performance achieved by the opti- 
mum Wiener filter for motion compensation with complementary hypotheses. 
In the following, we discuss the transfer function of the optimum Wiener fil- 
ter according to (3.45) for motion compensation with N = 2 complementary 
hypotheses and compare it to that of the optimum Wiener filter for motion 
compensation with N = l hypothesis. 




3.14 



Figure 3. 1 1. Transfer function of the Wiener filter for N = 1 , ft = 0, and no residual noise. 




Figure 3. 12. Transfer function of a component of the Wiener filter for N = 2, p = 0, and no 
residual noise. 
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Figs. 3.11 and 3. 12 depict the transfer function of one component of the op- 
timum Wiener filter according to (3.45) for N = 1 and N = 2. respectively. In 
both cases, we neglect the residual noise and use integer-pel accurate motion 
compensation. The transfer function of the filter for the superimposed pre- 
dictor is flatter and, hence, shows significantly less spectral selectivity when 
compared to the case of the single hypothesis predictor. In addition, the trans- 
fer function for motion compensation with complementary hypotheses seems 
to suppress low frequency components. We will investigate this further by 
means of cross sections of the transfer functions. 




Figure 3.13. Cross section at co y =0 of the transfer function of the Wiener filter for f} = 0 
and no residual noise. The one-hypothesis filter At = 1 is compared to the two-hypothesis filter 
N = 2 with optimized displacement error correlation coefficient p& = — 1 and uncorrelated 
displacement error = 0. 

Figs. 3.13 and 3.14 show cross sections at co y = 0 of the transfer functions 
of the optimum Wiener filter according to (3.45) for — 0 and /? = 1, respec- 
tively. The one-hypothesis filter IV = 1 is compared to the two-hypothesis filter 
N = 2 with optimized displacement error correlation coefficient = — 1 and 
uncorrelated displacement error p\ = 0. The spectral selectivity decreases for 
more accurate motion compensation. This allows us to accurately compensate 
high frequency components. Interestingly, the filter for motion compensation 
with complementary hypotheses amplifies high frequency components when 
compared to the two-hypothesis filter with uncorrelated displacement error. In 
other words, the optimum Wiener filter for motion compensation with comple- 
mentary hypotheses promotes high frequency components that are otherwise 
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Figure 3. 14. Cross section at a>y = 0 of the transfer function of the Wiener filter for yS = 1 
and no residual noise. The one-hypothesis filter N = 1 is compared to the two-hypothesis filter 
N = 2 with optimized displacement error correlation coefficient = — 1 and uncorrelated 
displacement error p & = 0. 



suppressed by the two-hypothesis filter with uncorrelated displacement error. 
These effects grow larger for less accurate motion compensation. 

As previously mentioned, very accurate motion compensation flattens the 
characteristic function of the displacement error PDF, that is, P((o,cr 2 ) = 1 
for — > 0. In this case, the optimum Wiener filter in (3.45) depends mainly 

on the power spectral density of the residual noise according to 



Fo(a>) = 



1 

N + a(co) 



(3.51) 



If we neglect the residual noise in (3.51), i.e., a (a>) — > 0, we obtain the aver- 
aging filter according to (3.35). In other words, considering only signal com- 
ponents that capture the motion in the model, the averaging filter is, in the 
limit, the optimum filter for very accurate motion compensation. Note, that 
the complementary hypotheses are not identical, even for very accurate motion 
compensation. 



3.4 Forward- Adaptive Hypothesis Switching 

Superimposed motion-compensated prediction combines more than one 
motion-compensated signal, or hypothesis, to predict the current frame signal. 
In particular, we do not specify how the multiple hypotheses are selected. Now, 
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we assume that they are determined by forward-adaptive hypothesis switching 
[206]. Hypothesis switching selects one motion-compensated signal from a 
set of M reference frames. The parameter which indicates a particular frame 
in the set of M reference frames is transmitted as side information to the de- 
coder. This models multiframe motion-compensated prediction as discussed 
in Section 2.2.4. In the following, we discuss this concept of forward-adaptive 
hypothesis switching for superimposed motion-compensated prediction and its 
performance with complementary hypotheses. 

Assume that we linearly combine N hypotheses. Each hypothesis that is 
used for the combination is selected from a set of motion-compensated signals 
of size M. We study the influence of the hypothesis set size M on both the 
accuracy of motion compensation of forward-adaptive hypothesis switching 
and the efficiency of superimposed motion estimation. In both cases, we ex- 
amine the noise-free limiting case. That is, we neglect signal components that 
are not predictable by motion compensation. Selecting one hypothesis from a 
set of motion-compensated signals of size M , that is, switching among M hy- 
potheses, reduces the displacement error variance by factor M, if we assume 
statistically independent displacement errors. Integrating forward-adaptive hy- 
pothesis switching into superimposed motion-compensated prediction, that is, 
allowing a combination of switched hypotheses, increases the gain of super- 
imposed motion-compensated prediction over the single hypothesis case for 
growing hypothesis set size M. 

Experimental results in Section 4.3.3 and 5.3.2 confirm that block-based 
multiframe motion compensation enhances the efficiency of superimposed pre- 
diction. The experimental setup is such that we combine both block-based pre- 
dictors and superimpose N hypotheses where each hypothesis is obtained by 
switching among M motion-compensated blocks. 

3.4.1 Signal Model for Hypothesis Switching 

A signal model for hypothesis switching is depicted in Fig. 3.15 for two 
hypotheses. The current frame signal s[/| at discrete location l = (x, y) is 
predicted by selecting between M hypotheses C M [/] with ji = 1, M. The 
resulting prediction error is denoted by e[Z] . 

To capture the limited accuracy of motion compensation, we associate a 
vector valued displacement error with the /x-th hypothesis c M . The dis- 
placement error reflects the inaccuracy of the displacement vector used for 
motion compensation. We assume a 2-D stationary normal distribution with 
variance and zero mean where x- and y-componcnts are statistically inde- 
pendent. The displacement error variance is the same for all M hypotheses. 
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Figure 3. 15. Forward-adaptive hypothesis switching for motion-compensated prediction. 

This is reasonable because all hypotheses are compensated with the same ac- 
curacy. Further, the pairs (A M , A„) are assumed to be statistically independent 
Gaussian random variables and that the displacement error for all hypotheses 
are spatially constant. 

For simplicity, we assume that all hypotheses c ^ are shifted versions of the 
current frame signal s. The shift is determined by the displacement error A M of 
the /x-th hypothesis. For that, the ideal reconstruction of the space-discrete sig- 
nal s [/] is shifted by the continuous valued displacement error and re-sampled 
on the original orthogonal grid. 

This model neglects “noisy” signal components and assumes that motion 
accuracy is basically the decision criterion for switching. As previously dis- 
cussed in Section 3.2.3, prediction error variance decreases by reducing the 
displacement error variance of hypotheses that are used for prediction. But 
a smaller displacement error variance can only be achieved by increasing the 
probability of individual displacement error that are “close” to the origin. In 
other words, we select from the set of M motion-compensated signals the one 
with the smallest displacement error. 

3.4.2 Minimizing the Radial Displacement Error 

Hypothesis switching improves the accuracy of motion compensation by 
selecting among M hypotheses the one with the smallest displacement error. 
Now, let us assume that the components of the displacement error for each 
hypothesis are i.i.d. Gaussian. The Euclidean distance to the zero displacement 
error vector defines the radial displacement error for each hypothesis. 

V = y/K + K < 3 ' 52 ) 

We assume that the hypothesis with minimum radial displacement error 

\ r M = min(A rl , . . . , A r „, . . . , A rW ) 

u 



( 3 . 53 ) 
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is used to predict the signal. 

In the following, we describe hypothesis switching by means of the relia- 
bility function [205] of the minimum radial displacement error. The reliability 
function R^ r M ( r ) is closely related to the distribution function and is defined 
as the probability of the event that A r w is larger than r. 

R^ir) ~ Pr{A r M > r) (3.54) 

The reliability function of the minimum radial displacement error can be ex- 
pressed in terms of the reliability function of the set of M hypotheses. The 
probability of the event that the minimum radial displacement error is larger 
than r is equal to the probability of the event that all radial displacement errors 
are larger than r. 

R^(r) = Pr{min(A rl ,..., A r/1 , .... A rW ) >r} 

= Pr{A rl >r, ...,A rM > r) 

= ^A rI ...A rM (ri ...,r) (3.55) 




Figure 3.16. The area in which the minimum of A r j and A r 2 is larger than r. 

Fig. 3.16 depicts an example for switching two hypotheses. It marks the 
area in which the minimum of two radial displacement errors is larger than r. 
Consequently, the probability of the event that the minimum radial displace- 
ment error is larger than r is equal to the probability of the event that both 
radial displacement errors are larger than r. 

Each displacement error is drawn from a 2-D normal distribution with zero 
mean and variance The displacement errors of the M hypotheses are 
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assumed to be statistically independent. The x- and y-components of the dis- 
placement errors are arranged to vectors A x and A y , respectively. 
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The criterion for switching is the radial displacement error. To obtain a 
closed-form expression of the radial displacement error PDF, it is assumed 
that the variances in x- and y-direction are identical. This is reasonable if the 
accuracy of motion compensation is identical for both dimensions. With this 
assumption, we can easily determine the probability density function of A r 
[ 205 ]: 
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(3.58) 



An M-dimensional Rayleigh PDF is obtained describing M independent radial 
displacement errors. 

In order to minimize the radial displacement error, the M-dimensional reli- 
ability function of the displacement error has to be determined. 
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The reliability function of the minimum radial displacement error R^M(r) 
is obtained by evaluating the M-dimensional reliability function at the same 
value r for all dimensions: /? Ar w(r) = /J Ar |...A rW (lt - ). 1 denotes the vector 
with all components equal to one. The minimum radial displacement error is 
also Rayleigh distributed. Note that a one-dimensional PDF is given by the 
negative derivative of the reliability function. 
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The variance of the minimum radial displacement error is of interest. The 
covariance matrix of the Rayleigh PDF in (3.58) is CA r A r = (2 — §)Ca x a x 
and the variance of the switched radial displacement error is given by a 2 M = 
(2 — |)r 2 [140]. In order to omit the constant factor, the variance of the 
minimum radial displacement error is stated as a function of the covariance 
matrix C Ar A r - 




l^Arl 



( 3 . 63 ) 



For example, the variances of the radial displacement errors might be identical 
for all M hypotheses. (3.63) implies that switching of independent Rayleigh 
distributed radial displacement errors reduces the variance by factor M. 



a A r M ~ 




( 3 . 64 ) 



3.4.3 Equivalent Predictor 

Section 3.4.2 shows that both the individual radial displacement errors and 
the minimum radial displacement error are Rayleigh distributed. This suggests 
to define an equivalent motion-compensated predictor for switched prediction 
that uses just one hypothesis but with a much smaller the displacement error 
variance. The equivalent distribution of the displacement error is assumed to 
be separable and normal with zero mean and variance 
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Figure 3.17 gives an example for the equivalent predictor when switching 
M independent and identically distributed displacement error signals. The 2-D 
and Gaussian distributed displacement error of each hypothesis is transformed 
into the Rayleigh distributed radial displacement error. Minimizing M inde- 
pendent radial displacement error signals yields a radial displacement error 
which is also Rayleigh distributed. The inverse transform into the 2-D Carte- 
sian coordinate system determines the 2-D normal distributed displacement 
error of the equivalent predictor. Its displacement error variance is reduced by 
factor M in comparison to the variance of individual hypotheses. 

The equivalent predictor suggests a model for switched prediction with just 
one hypothesis but reduced displacement error variance. This reduced vari- 
ance determines the accuracy of motion-compensated prediction with forward- 
adaptive hypothesis switching. That is, switching improves the accuracy of 
motion-compensated prediction by reducing the displacement error variance. 
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Lisplacement error. 
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3.4.4 Motion Compensation with Complementary 

Hypotheses and Forward- Adaptive Hypothesis 
Switching 

We combine forward-adaptive hypothesis switching with complementary 
hypotheses motion-compensated prediction such that we superimpose N com- 
plementary hypotheses where each hypothesis is obtained by switching among 
M motion-compensated signals. 



s[/] e[/] 




Figure 3.18. Motion-compensated prediction with N = 2 complementary hypotheses and 
forward-adaptive hypothesis switching M = 2. 

Fig. 3.18 depicts a prediction scheme where the complementary hypotheses 
C] and C 2 are determined by switching for each hypothesis among M = 2 
motion-compensated signals. In practice, N = 2 complementary hypotheses 
are selected from M = 2 reference frames. Further, as the complementary 
hypotheses are jointly estimated, the frame selection is also performed jointly. 

Let us consider the example that we switch among M motion-compensated 
signals and utilize complementary hypotheses motion compensation with N = 
2 hypotheses cj and C 2 . Then, complementary hypotheses motion compensa- 
tion uses two sets of motion-compensated signals of size M. The hypothe- 
sis Ci is selected from the set {cn, . . . , Ciat), and the complementary hypoth- 
esis C 2 from the complementary set {C 21 , . . . , Cim). Choosing one motion- 
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compensated signal from each set provides two hypotheses whose displace- 
ment error correlation coefficient is -1. But choosing two motion-compensated 
signals from the same set provides two signals whose displacement error cor- 
relation coefficient is 0. For this example, we assume that these hypotheses 
exist and that an ideal superimposed motion estimator is able to determine the 
desired signals. 

According to the previous section, choosing among M motion-compensated 
signals can reduce the displacement error variance by up to a factor of M. Mo- 
tion compensation with complementary hypotheses utilizes for each hypothesis 
forward-adaptive switching. Consequently, the displacement error variance of 
superimposed hypotheses o\ in (3.9) is smaller by a factor of M 
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Note, that (3.66) represents a performance bound given the previous assump- 
tions. 




Figure 3.19. Rate difference over the number of motion-compensated signals M for motion 
compensation with complementary hypotheses and forward-adaptive hypothesis switching. The 
switched hypotheses are just averaged and no residual noise is assumed. The results are for half- 
pel accurate motion compensation, i.e., f) = -l. 

Fig. 3.19 depicts the rate difference over the size of the motion-compensated 
signal set M according to (3.66). The performance bound of motion com- 
pensation with complementary hypotheses and forward-adaptive hypothesis 
switching for N = 2, 4, 8, and oo linearly combined hypotheses is compared 
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to motion-compensated prediction with forward-adaptive hypothesis switching 
(N = 1). Half -pel accurate motion compensation (/3 = — 1) is assumed. We 
observe that doubling the number of reference hypotheses decreases the bit- 
rate for motion-compensated prediction by 0.5 bits per sample and for motion- 
compensated prediction with complementary hypotheses by 1 bit per sample. 
Due to the different slopes, the bit-rate savings by complementary hypotheses 
motion compensation over single hypothesis motion compensation increase 
with the number of available motion-compensated signals M. This theoretical 
finding supports the experimental results in Section 4.3.3 and 5.3.2. 

The experimental results show also a saturation of the gain by forward- 
adaptive hypothesis switching. Choosing among M motion-compensated sig- 
nals can reduce the displacement error variance by up to a factor of M. This 
lower bound is obtained when switching among M motion-compensated sig- 
nals with statistically independent displacement error. Correlated displacement 
error degrade the performance of hypothesis switching and cause a saturation 
of the bit-rate savings for an increasing number of motion-compensated signals 
M. 

3.5 Pictures with Varying Number of Hypotheses 

In practice, video sequences are usually coded with two different picture 
types: P- and B-pictures. P-pictures use block-based (multiframe) motion- 
compensated prediction with temporally prior reference pictures whereas B- 
pictures utilize block-based bidirectional prediction with one (multiple) tem- 
porally subsequent and one (multiple) temporally prior reference picture(s). 
Picture type sequences with a varying number of inserted B-pictures like P, PB, 
PBB, and PBBB are widely used. Bidirectional prediction in B-pictures is a 
special case of superimposed motion-compensated prediction with N = 2 hy- 
potheses. If we neglect OBMC, motion-compensated prediction in P-pictures 
is predominantly single hypothesis prediction. 

In the following, we investigate the impact of motion compensation with 
complementary hypotheses on the sequence coding performance based on the 
previously discussed high-rate approximation for sequential encoding. At high 
rates, the residual encoder guarantees infinitesimal small reconstruction dis- 
tortion and any picture coding order that exploits all statistical dependencies 
provides the same coding efficiency. B-pictures with complementary hypoth- 
esis motion compensation provide improved prediction performance over P- 
pictures and their relative occurrence improves sequence coding performance. 
Because of the high-rate approximation, we neglect the rate of the side infor- 
mation. 
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The average rate difference for predictive sequence coding compared to op- 
timum intra-frame sequence coding yields 

A/? = -A/?/>4- HB A R b , (3.67) 

np + tiB tip+tiB 

where tip and tip are the number of P- and B-pictures in the sequence. A Rp 
denotes the rate difference of single hypothesis prediction and A Rb that of 
motion compensation with complementary hypotheses. The relative number 
of P- and B-pictures in a sequence determines the contribution of a particular 
picture type to the average rate difference. 




Figure 3.20. Performance bounds for five different picture type sequences. The rate difference 
is depicted over the displacement inaccuracy. B-pictures average two complementary hypothe- 
ses. No residual noise is assumed. 

Fig. 3.20 depicts the rate difference for predictive sequence coding com- 
pared to optimum intra- frame sequence coding over the motion accuracy fi. 
No residual noise is assumed. The rate difference is given for five different 
picture type sequences: P, PB, PBB, and PBBB with 0, 6, 8, and 9 B-pictures 
in a group of 12 pictures where the B-pictures utilize motion-compensated 
prediction with two complementary hypotheses. S denotes a sequence of su- 
perimposed pictures which also use motion-compensated prediction with two 
complementary hypotheses. As the slope of the rate difference is not equal 
for P- and B-pictures, the linear combination in (3.67) affects the slope of the 
average rate difference. In the limit, the picture type sequences PB, PBB, and 
PBBB modify the slope of the average rate difference to 3/2, 5/3, and 7/4 bits 
per sample per inaccuracy step, respectively. 
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Figure 3.21. Performance bounds for five different picture type sequences. The rate difference 
is depicted over the displacement inaccuracy. B-pictures average two complementary hypothe- 
ses. The residual noise is -30 dB. 

Fig. 3.21 depicts si mi lar to Fig. 3.20 the rate difference for predictive se- 
quence coding compared to optimum intra-frame sequence coding over the 
motion accuracy but with a residual noise of RNL = -30 dB. In the presence 
of residual noise, the advantage due to the slope difference is mainly prevalent 
in the range between quarter-pel and integer-pel accuracy. Beyond quarter-pel 
accuracy, the residual noise dominates and the average rate difference stalls to 
saturate. 

3.6 Conclusions 

This chapter extends the theory of multihypothesis motion-compensated 
prediction by introducing the concept of motion-compensation with comple- 
mentary hypotheses. We allow for the displacement errors of N hypotheses to 
be correlated. The assumption that the N displacement errors are jointly dis- 
tributed imposes an constraint on the displacement error correlation coefficient 
as any covariance matrix is nonnegative definite. We analyze the dependency 
between the displacement error correlation coefficient and the performance of 
superimposed motion compensation. A ideal superimposed motion estimator 
minimizes the prediction error and, consequently, the displacement error cor- 
relation coefficient. The optimal ideal superimposed motion estimator is the 
estimator that determines N complementary hypotheses with maximally neg- 
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atively correlated displacement error. This is a result which is independent of 
the underlying frame signal model. 

Assuming band-limited and noise-free frame signals, the motion compen- 
sator with complementary hypotheses achieves a rate difference of up to 2 bits 
per sample per inaccuracy step whereas a single hypothesis motion compen- 
sator is limited to 1 bit per sample per inaccuracy step. When averaging more 
than two hypotheses, the bit-rate savings are limited even if the number of 
hypotheses grows very large. When utilizing the optimum Wiener filter, the 
bit-rate savings are not limited and doubling the number of hypotheses im- 
proves the rate difference by 0.5 bits per sample for a very large number of 
hypotheses. 

In the presence of residual noise, the bit -rate savings saturate for increasing 
motion accuracy. The optimum Wiener filter with an infinite number of hy- 
potheses permits bit-rate savings that are independent of the motion accuracy 
and limited only by the residual noise. In addition, the Wiener filter for motion 
compensation with complementary hypotheses amplifies high frequency com- 
ponents and shows band-pass characteristics. The optimum Wiener filter for 
single hypothesis motion compensation shows only low-pass characteristics. 

This chapter also combines forward-adaptive hypothesis switching with 
complementary hypothesis motion compensation. Choosing among M mo- 
tion-compensated signals with statistically independent displacement error re- 
duces the displacement error variance by up to a factor of M. We utilize mo- 
tion compensation with complementary hypotheses such that each hypothesis 
is determined by switching among M reference hypotheses. An analysis of 
the noise-free case shows that doubling the number of reference hypotheses 
decreases the bit-rate of motion-compensated prediction by 0.5 bits per sam- 
ple and that of motion compensation with complementary hypotheses by 1 bit 
per sample. Due to the different slopes, the bit-rate savings by complementary 
hypotheses motion compensation over single hypothesis motion compensation 
increase with the number of available reference signals M. 

Finally, this chapter discusses sequence coding with different picture types 
characterized by a varying number of hypotheses. As the slope of the rate 
difference is not equal for single hypothesis prediction (P-pictures) and super- 
imposed prediction with complementary hypotheses (B-pictures), particular 
picture type sequences influence the slope of the overall average rate differ- 
ence. The analysis suggests that the number of pictures with complementary 
hypotheses motion compensation should be increased such that the overall av- 
erage rate difference benefits from complementary hypotheses. 
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ITU-T REC. H.263 AND 
SUPERIMPOSED PREDICTION 



4.1 Introduction 

This chapter discusses a practical implementation of superimposed motion- 
compensated prediction according to Section 2.2.5, based on the ITU-T Rec- 
ommendation H.263 [63, 14]. The general concept of superimposed prediction 
with multiple reference frames is not part of this standard. We investigate the 
efficient number of hypotheses, the combination with variable block size pre- 
diction, and the influence of multiframe motion compensation. Further, we 
relate the experimental results to the insights from Chapter 3. 

ITU-T Recommendation H.263 utilizes a hybrid video coding concept with 
block-based motion-compensated prediction and DCT-based transform cod- 
ing of the prediction error. P-frame coding of H.263 employs INTRA and 
INTER coding modes. Superimposed motion-compensated prediction for P- 
frame coding is enabled by new coding modes that are derived from H.263 
INTER coding modes. Annex U of ITU-T Rec. H.263 allows multiframe 
motion-compensated prediction but does not provide capabilities for superim- 
posed prediction. A combination of H.263 Annex U with B -frames leads to the 
concept of superimposed multiframe prediction. In this chapter, we do not use 
H.263 B -frames as we discuss interpolative prediction for in-order encoding 
of sequences. H.263 B -frames can only be used for out-of-order encoding of 
sequences. Further, the presented concept of superimposed multiframe predic- 
tion is much more general than the B-frames in H.263. ITU-T Rec. H.263 also 
provides OBMC capability. As discussed previously, OBMC uses more than 
one motion vector for predicting the same pixel but those motion vectors are 
also used by neighboring blocks. In this chapter, a block predicted by super- 
imposed motion compensation has its individual set of motion vectors. We do 
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not overlap shifted blocks that might be obtained by utilizing spatially neigh- 
boring motion vectors. The INTER4V coding mode of H.263 utilizes VBS 
prediction with either OBMC or an in-loop deblocking filter. Superimposed 
prediction already filters the motion-compensated signals and an extension of 
H.263 OBMC or in-loop deblocking filter is not implemented. 

The outline of this chapter is as follows: Section 4.2 explains the video 
codec with superimposed motion-compensated prediction. We outline syntax 
extensions for H.263 as well as coder control issues for mode decision and mo- 
tion estimation. Section 4.3 discusses several experiments. We investigate the 
efficient number of hypotheses, combine superimposed prediction with blocks 
of variable size, and study the influence of multiframe motion compensation. 

4.2 Video Coding with Superimposed Motion 

ITU-T Recommendation H.263 standardizes a block-based hybrid video 
codec. Such a codec utilizes motion-compensated prediction to generate a 
prediction signal from previous reconstructed frames in order to reduce the 
bit-rate of the residual encoder. For block-based MCP, one motion vector and 
one picture reference parameter which address the reference block in a previ- 
ous reconstructed frame are assigned to each block in the current frame. 

The superposition video codec [207] additionally reduces the bit-rate of 
the residual encoder by improving the prediction signal. The improvement 
is achieved by combining linearly more than one motion-compensated predic- 
tion signal. For block-based superimposed MCP, more than one motion vector 
and picture reference parameter, which address a reference block in previous 
reconstructed frames, is assigned to each block in the current frame. These 
multiple reference blocks are linearly combined to form the block-based su- 
perimposed prediction signal. 

The coding efficiency is improved at the expense of increased computa- 
tional complexity for motion estimation at the encoder. But this disadvantage 
can be tackled by efficient estimation strategies like successive elimination as 
discussed in Section 2.3.4. At the decoder, a minor complexity increase is 
caused by the selection and combination of multiple prediction signals. Please 
note that not all macroblocks utilize superimposed MCP. 

4.2.1 Syntax Extensions 

The syntax of H.263 is extended such that superimposed motion compen- 
sation is possible. On the macroblock level, two new modes, INTER2H and 
INTER4H, are added which allow two or four hypotheses per macroblock, 
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respectively. These modes are similar to the INTER mode of H.263. The IN- 
TER2H mode additionally includes an extra motion vector and frame reference 
parameter for the second hypothesis. The INTER4H mode incorporates three 
extra motion vectors and frame reference parameters. For variable block size 
prediction, the INTER4V mode of H.263 is extended by a superposition block 
pattern. This pattern indicates for each 8x8 block the number of motion vec- 
tors and frame reference parameters. This mode is called INTER4VMH. The 
superposition block pattern has the advantage that the number of hypotheses 
can be indicated individually for each 8x8 block. This allows the impor- 
tant case that just one 8x8 block can be coded with more than one motion 
vector and frame reference parameter. The INTER4VMH mode includes the 
INTER4V mode when the superposition block pattern indicates just one hy- 
pothesis for all 8 x 8 blocks. 

4.2.2 Coder Control 

The coder control for the superposition video codec utilizes rate-distortion 
optimization by Lagrangian methods. For that, the average Lagrangian cost of 
a macroblock, given the previously encoded macroblocks, is minimized. The 
average cost J = D+kR consists of the average distortion I) and the weighted 
average bit-rate R. The weight, also called Lagrange multiplier X, is related to 
the macroblock quantization parameter QP by the relationship 

k = 0.85(2 P 2 (4.1) 

as discussed in Section 2.3.3. This generic optimization method provides 
the encoding strategy for the superposition encoder: Minimizing the instan- 
taneous Lagrangian costs for each macroblock minimizes the average La- 
grangian costs, given the previous encoded macroblocks. 

H.263 allows several encoding modes for each macroblock. The one with 
the lowest Lagrangian costs is selected for the encoding. This strategy is also 
called rate-constrained mode decision [208], [121]. 

The new superposition modes include both superimposed prediction and 
prediction error encoding. The Lagrangian costs of the new superposition 
modes have to be evaluated for rate-constrained mode decision. The distor- 
tion of the reconstructed macroblock is determined by the summed squared 
error. The macroblock bit-rate includes also the rate of all motion vectors and 
picture reference parameters. This allows the best trade-off between superim- 
posed MCP rate and prediction error rate [122]. 

As already mentioned, superimposed MCP improves the prediction signal 
by spending more bits for the side-information associated with the motion- 
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compensating predictor. But the encoding of the prediction error and its asso- 
ciated bit-rate also determines the quality of the reconstructed block. A joint 
optimization of superimposed motion estimation and prediction error encod- 
ing is far too demanding. But superimposed motion estimation independent of 
prediction error encoding is an efficient and practical solution. This solution 
is efficient if rate-constrained superimposed motion estimation, as explained 
before, is applied. 

For example, the encoding strategies for the INTER and INTER2H modes 
are as follows: Testing the INTER mode, the encoder performs succes- 
sively rate-constrained motion estimation for integer-pel positions and rate- 
constrained half-pel refinement. Rate-constrained motion estimation incorpo- 
rates the prediction error of the video signal as well as the bit-rate for the 
motion vector and picture reference parameter. Testing the INTER2H mode, 
the encoder performs rate-constrained superimposed motion estimation. Rate- 
constrained superimposed motion estimation incorporates the superimposed 
prediction error of the video signal as well as the bit-rate for two motion vec- 
tors and picture reference parameters. Rate-constrained superimposed motion 
estimation is performed by the HSA in Fig. 2.6 which utilizes in each itera- 
tion step rate-constrained motion estimation to determine a conditional rate- 
constrained motion estimate. Given the obtained motion vectors and picture 
reference parameters for the INTER and INTER2H modes, the resulting pre- 
diction errors are encoded to evaluate the mode costs. The encoding strategy 
for the INTER4H mode is similar. For the INTER4VMH mode, the number of 
hypotheses for each 8x8 block is determined after encoding its residual error. 

4.3 Experimental Results 

The superposition codec is based on the ITU-T Rec. H.263 with unrestricted 
motion vector mode, four motion vectors per macroblock, and enhanced refer- 
ence picture selection in sliding window buffering mode. In contrast to H.263, 
the superposition codec uses a joint entropy code for horizontal and vertical 
motion vector data as well as an entropy code for the picture reference param- 
eter. The efficiency of the reference codec is comparable to those of the H.263 
test model TMN-10 [209]. The test sequences are coded at QCIF resolution 
and 10 fps. Each sequence has a length of ten seconds. For comparison pur- 
poses, the PSNR values of the luminance component are measured and plotted 
over the total bit-rate for the quantization parameter 4, 5, 7, 10, 15, and 25. 
The data of the first intra-frame coded picture, which is identical in all cases, 
is excluded from the results. 




ITU-T Rec. H.263 and Superimposed Prediction 



73 



4.3.1 Multiple Hypotheses for Constant Block Size 

We investigate the coding efficiency of superimposed prediction with two 
and four hypotheses for constant block size. Figs. 4.1 and 4.2 depict the aver- 
age luminance PSNR from reconstructed frames over the overall bit-rate for 
the sequences Foreman and Mobile & Calendar, respectively. The perfor- 
mance of the codec with baseline prediction (BL), superimposed prediction 
with two hypotheses (BL + INTER2F1), and four hypotheses (BL + INTER2FI 
+ INTER4FI) is shown. In each case, M - 10 reference pictures are utilized 
for prediction. The baseline performance for single frame prediction ( M = 1 ) 
is added for reference. 




Figure 4.1. Average luminance PSNR over total rate for the sequence Foreman depicting the 
performance of the superposition coding scheme for constant block size. M = 10 reference 
pictures are utilized for prediction. 

Superimposed prediction is enabled by allowing the INTER2H mode on the 
macroblock level. A gain of up to 1 dB for the sequence Foreman and 1.4 
dB for the sequence Mobile & Calendar is achieved by the INTER2FI mode. 
Superimposed prediction with up to four hypotheses is implemented such that 
each predictor type (depending on the number of superimposed signals) con- 
stitutes a coding mode. A rate-distortion efficient codec should utilize four 
hypotheses only when their coding gain is justified by the associated bit-rate. 
In the case that four hypotheses are not efficient, the codec should be able to 
select two hypotheses and choose the INTER2FI mode. The additional IN- 
TER4FI mode gains just up to 0.1 dB for the sequence Foreman and 0.3 dB for 
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Figure 4.2. Average luminance PSNR over total rate for the sequence Mobile & Calendar 
depicting the performance of the superposition coding scheme for constant block size. M = 10 
reference pictures are utilized for prediction. 



the sequence Mobile & Calendar. This results support the finding in Section 

3.2.2 that two hypotheses provide the largest relative gain. Considering both 
this insight and the computational complexity for estimating four hypotheses, 
we will restrict the superposition coding scheme to two hypotheses. 

Finally, we consider hypotheses that are not optimized with the hypothesis 
selection algorithm in Fig. 2.6. With only previous reference frames, not op- 
timized hypotheses are as good as or worse (due to the bit-rate of additional 
motion vectors) than single hypothesis prediction. In that case, the mode se- 
lection prefers the single hypothesis mode and the rate-distortion performance 
is identical to that with the label “M = 10, BL”. 

4.3.2 Multiple Hypotheses for Variable Block Size 

In this subsection, we investigate the influence of variable block size (VBS) 
prediction on superimposed prediction for M = 10 reference pictures. VBS 
prediction in H.263 is enabled by the INTER4V mode which utilizes four mo- 
tion vectors per macroblock. Both VBS prediction and superimposed predic- 
tion use more than one motion vector per macroblock which is transmitted to 
the decoder as side-information. But both concepts provide gains for different 
scenarios. This can be verified by applying superimposed prediction to blocks 
of size 16 x 16 (INTER2H) as well as 8 x 8 (INTER4VMH). As we permit a 
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maximum of two hypotheses per block, one bit is sufficient to signal whether 
one or two prediction signals are used. 




Figure 4.3. Average luminance PSNR over total rate for the sequence Foreman. Superimposed 
and variable block size prediction can be successfully combined for compression. M — 10 
reference pictures are utilized for prediction. 




Figure 4.4. Average luminance PSNR over total rate for the sequence Mobile & Calendar. 
Superimposed and variable block size prediction can be successfully combined for compression. 
M = 10 reference pictures are utilized for prediction. 
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Figs. 4.3 and 4.4 depict the average luminance PSNR from reconstructed 
frames over the overall bit-rate for the sequences Foreman and Mobile & Cal- 
endar. The performance of the codec with baseline prediction (BL), VBS pre- 
diction (BL + VBS), superimposed prediction with two hypotheses (BL + IN- 
TER2H), and superimposed prediction with variable block size (BL + VBS + 
MHP(2)) is shown. In each case, M = 10 reference pictures are utilized for 
prediction. The baseline performance for single frame prediction (M = 1) is 
added for reference. 

The combination of superimposed and variable block size prediction yields 
superior compression efficiency. For example, to achieve a reconstruction 
quality of 35 dB in PSNR, the sequence Mobile & Calendar is coded in base- 
line mode with 403 kbit/s for M = 10 (See Fig. 4.4). Correspondingly, super- 
imposed prediction with M = 10 reduces the bit-rate to 334 kbit/s. Superim- 
posed prediction on macroblocks decreases the bit-rate by 17%. Performing 
superimposed prediction additionally on 8 x 8 blocks, the bit -rate is 290 kbit/s 
in contrast to 358 kbit/s for the codec with VBS. Superimposed prediction de- 
creases the bit-rate by 19% relative to our codec with VBS prediction. Similar 
observations can be made for the sequence Foreman at 120 kbit/s. Superim- 
posed prediction on macroblocks gains about 1 dB over baseline prediction for 
M = 10 (See Fig. 4.3). Performing superimposed prediction additionally on 
8x8 blocks, the gain is about 0.9 dB compared to the codec with VBS and 
M = 10 reference pictures. 

Please note that the coding efficiency for the sequences Foreman (Fig. 4.3) 
and Mobile & Calendar (Fig. 4.4) is comparable for VBS prediction (BL + 
VBS) and superimposed prediction with two hypotheses (BL + INTER2H) 
over the range of bit-rates considered. Superimposed prediction utilizes just 
two motion vectors and picture reference parameters compared to four for the 
INTER4V mode. 

For variable block size prediction, four hypotheses provide also no signifi- 
cant improvement over two hypotheses. For example, the superposition codec 
with VBS and four hypotheses achieves just up to 0.3 dB gain over the codec 
with two hypotheses for the sequence Mobile & Calendar. 

In summary, superimposed prediction works efficiently for both 16x16 and 
8x8 blocks. The savings due to superimposed prediction are observed in the 
baseline mode as well as in the VBS prediction mode. Hence, the hypothesis 
selection algorithm in Fig. 2.6 is able to find two prediction signals on M = 10 
reference frames which are combined more efficiently than just one prediction 
signal from these reference frames. 
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4.3.3 Multiple Hypotheses and Multiple Reference Pictures 

The results presented so far are obtained for superimposed motion-compen- 
sated prediction with M = 10 reference pictures in sliding window buffering 
mode. In this section, the influence of multiple reference frames on the super- 
position codec is investigated [210]. It is demonstrated that two hypotheses 
chosen only from the prior decoded frame, i.e. M = 1 also improve cod- 
ing efficiency. Additionally, the use of multiple reference frames enhances the 
efficiency of the supeiposition codec. 

Figs. 4.5 and 4.6 show the bit-rate savings at 35 dB PSNR of the decoded 
luminance signal over the number of reference frames M for the sequences 
Foreman and Mobile & Calendar, respectively. We compute PSNR vs. bit- 
rate curves by varying the quantization parameter and interpolate intermediate 
points by a cubic spline. The performance of the codec with variable block size 
prediction (VBS) is compared to the supeiposition codec with two hypotheses 
(VBS + MHP(2)). Results are depicted for M - 1,2, 5,10, and 20. 

Figs. 4.7 and 4.8 show the same experiment as Figs. 4.5 and 4.6 but depict 
the absolute bit-rate over the number of reference pictures M for the sequences 
Foreman and Mobile & Calendar, respectively. The relative bit-rate savings 
with two hypotheses are given. 

The supeiposition codec with M - 1 reference frame has to choose both 
prediction signals from the previous decoded frame. The supeiposition codec 
with VBS saves 7% for the sequence Foreman and 9% for the sequence Mobile 
& Calendar when compared to the VBS codec with one reference frame. For 
M > 1, more than one reference frame is allowed for each prediction signal. 
The reference frames for both hypotheses are selected by the rate-constrained 
superimposed motion estimation algorithm. The picture reference parameter 
allows also the special case that both hypotheses are chosen from the same 
reference frame. The rate constraint is responsible for the trade-off between 
prediction quality and bit-rate. For M - 20 reference frames, the supeiposi- 
tion codec with VBS saves 25% for the sequence Foreman and 31% for the 
sequence Mobile & Calendar when compared to the VBS codec with one ref- 
erence frame. For the same number of reference frames, the VBS codec saves 
about 15% for both sequences. The supeiposition codec with VBS benefits 
when being combined with multiframe prediction so that the savings are more 
than additive. The bit-rate savings saturate for 20 reference frames for both 
sequences. 

In Section 3.4.4 we model multiframe motion compensation by forward- 
adaptive hypothesis switching. When being combined with complementary 
hypotheses motion compensation, we observe that the bit-rate savings by com- 
plementary hypotheses motion compensation over single hypothesis motion 
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Figure 4.5. Bit-rate savings at 35 dB PSNR over the number of reference pictures M for the 
sequence Foreman. For variable block sizes, the performance of the superposition codec is 
compared to the reference codec. 




Figure 4.6. Bit-rate savings at 35 dB PSNR over the number of reference pictures M for the 
sequence Mobile & Calendar. For variable block sizes, the performance of the superposition 
codec is compared to the reference codec. 



compensation increases with the number of motion-compensated signals M. 
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Figure 4. 7. Absolute bit-rate at 35 dB PSNR over the number of reference pictures M for 
the sequence Foreman. For variable block sizes, the performance of the superposition codec is 
compared to the reference codec. 




Figure 4.8. Absolute bit-rate at 35 dB PSNR over the number of reference pictures M for the 
sequence Mobile & Calendar. For variable block sizes, the performance of the superposition 
codec is compared to the reference codec. 

For that, N = 2 complementary hypotheses are sufficient. This theoretical 
result is consistent with the previous experimental results. 
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Figure 4.9. Average luminance PSNR over total rate for the sequence Foreman. The perfor- 
mance of the superposition codec with variable block size is depicted for M = 1 and M = 20 
reference frames. 




R [kbit/s] 



Figure 4.10. Average luminance PSNR over total rate for the sequence Mobile & Calendar. 
The performance of the superposition codec with variable block size is depicted for M = 1 and 
M = 20 reference frames. 
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Figure 4.11. Average luminance PSNR over total rate for the sequence Foreman. The perfor- 
mance of the superposition codec with variable block size and multiframe motion compensation 
is compared to the reference codec. 




Figure 4. 12. Average luminance PSNR over total rate for the sequence Mobile & Calendar. 
The performance of the superposition codec with variable block size and multiframe motion 
compensation is compared to the reference codec. 
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Figure 4. 13. Average luminance PSNR over total rate for the sequence Sean. The performance 
of the superposition codec with variable block size and multiframe motion compensation is 
compared to the reference codec. 




Figure 4. 14. Average luminance PSNR over total rate for the sequence Weather. The perfor- 
mance of the superposition codec with variable block size and multiframe motion compensation 
is compared to the reference codec. 
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Figs. 4.9 and 4.10 depict the average luminance PSNR over the total bit-rate 
for the sequences Foreman and Mobile & Calendar. The supeiposition codec 
with variable block size (VBS + MHP(2)) is compared to the variable block 
size codec (VBS) for M = 1 and M = 20 reference frames. We can observe 
in these figures that superimposed prediction in combination with multiframe 
motion compensation achieves coding gains up to 1.8 dB for Foreman and 2.8 
dB for Mobile & Calendar. It is also observed that the use of multiple ref- 
erence frames enhances the efficiency of superimposed motion-compensated 
prediction for video compression. 

Finally, Figs. 4.5 and 4.6 suggest that a frame memory of M = 10 provides 
a good trade-off between encoder complexity and compression efficiency for 
our superposition codec. Therefore, Figs. 4.11, 4.12, 4.13, and 4.14 compare 
the superposition codec with variable block size and frame memory M = 10 
to the reference codec with frame memory M = 1 and M = 10 for the se- 
quences Foreman , Mobile & Calendar, Sean, and Weather, respectively. For 
each sequence the average luminance PSNR is depicted over the total bit-rate. 
The superposition codec with multiframe motion compensation achieves cod- 
ing gains up to 1.8 dB for Foreman, 2.7 dB for Mobile & Calendar, 1.6 dB 
for Sean, and 1.5 dB for Weather compared to the reference codec with frame 
memory M = 1. The gain by multiframe prediction and superimposed predic- 
tion is comparable for the presented sequences. 

4.4 Conclusions 

In our experiments, we observe that variable block size and superimposed 
prediction can be combined successfully. Superimposed prediction works effi- 
ciently for both 16 x 16 and 8x8 blocks. Multiframe motion compensation 
enhances the efficiency of superimposed prediction. The superposition gain 
and the multiframe gain do not only add up; superimposed prediction benefits 
from hypotheses that can be chosen from different reference frames. Superim- 
posed motion-compensated prediction with two hypotheses and ten reference 
frames achieves coding gains up to 2.7 dB, or equivalently, bit-rate savings 
up to 30% for the sequence Mobile & Calendar when compared to the refer- 
ence codec with one reference frame. Therefore, superimposed prediction with 
multiframe and variable block size motion compensation is very efficient and 
practical for video compression. 




Chapter 5 



ITU-T REC. H.264 AND 
GENERALIZED B PICTURES 



5.1 Introduction 



This chapter discusses B -pictures in the context of the draft H.264 video 
compression standard. B -pictures are pictures in a motion video sequence that 
are encoded using both past and future pictures as references. The predic- 
tion is obtained by a linear combination of forward and backward prediction 
signals usually obtained with motion compensation. However, such a super- 
position is not necessarily limited to forward and backward prediction sig- 
nals as investigated in Chapter 4. For example, a linear combination of two 
forward prediction signals can also be efficient in terms of compression effi- 
ciency. The prediction method which linearly combines motion-compensated 
signals regardless of the reference picture selection will be referred to as su- 
perimposed motion-compensated prediction. The concept of reference picture 
selection [91], also called multiple reference picture prediction, is utilized to 
allow prediction from both temporal directions. In this particular case, a bidi- 
rectional picture reference parameter addresses both past and future reference 
pictures [211]. This generalization in terms of picture reference selection and 
linearly combined prediction signals is reflected in the term generalized B- 
pictures and is paid of the emerging H.264 video compression standard [212], 
It is desirable that an arbitrary pair of reference pictures can be signaled to 
the decoder [213, 214]. This includes the classical combination of forward 
and backward prediction signals but also allows forward/forward as well as 
backward/backward pairs. When combining the two most previous pictures, a 
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functionality similar to the dual-prime mode in MPEG-2 [13, 97] is achieved, 
where top and bottom fields are averaged to form the final prediction. 

B -pictures in H.264 have been improved in several ways compared to B- 
pictures in MPEG-2 [13] and H.263 [14]. The block size for motion compen- 
sation can range from 16x16 to 4x4 pixels and the direct mode with weighted 
blending allows not only a scaling of the motion vectors but also a weighting 
of the prediction signal. The ongoing H.264 development will also provide 
improved H.263 Annex U functionality. H.263 Annex U, Enhanced Reference 
Picture Selection, already allows multiple reference pictures for forward pre- 
diction and two-picture backward prediction in B-pictures. When choosing 
between the most recent and the subsequent reference picture, the multiple ref- 
erence picture selection capability is very limited. Utilizing multiple prior and 
subsequent reference pictures improves the compression efficiency of H.263 
B-pictures. 

The H.264 test model software TML-9 uses only inter pictures as reference 
pictures to predict the B-pictures. Beyond the test model software TML-9, 
and different from past standards, the multiple reference picture framework 
in H.264 also allows previously decoded B-pictures to be used as reference 
to improve prediction efficiency [215]. B-pictures can be utilized to establish 
an enhancement layer in a layered representation and allow temporal scala- 
bility [216]. That is, decoding of a sequence at more than one frame rate is 
achievable. In addition to this functionality, B-pictures generally improve the 
overall compression efficiency as compared to that of inter pictures only [217]. 
On the other hand, they increase the time delay due to multiple future refer- 
ence pictures. But this disadvantage is not critical in applications like Internet 
streaming and multimedia storage for entertainment purposes. 

The outline of this chapter is as follows: Section 5.2 introduces B-picture 
prediction modes. After an overview, direct and superposition mode are dis- 
cussed in more detail and a rate-distortion performance comparison of three 
mode classes is provided. Section 5.3 elaborates on superimposed prediction. 
The difference between bidirectional and superposition mode is outlined and 
quantified in experimental results. In addition, the efficiency of two combined 
forward prediction signals is also investigated. Finally, both entropy coding 
schemes of H.264 are investigated with respect to the superposition mode. En- 
coder issues are detailed in Section 5.4, which covers rate-constrained mode 
decision, motion estimation, and superimposed motion estimation. In addition, 
the improvement of the overall rate-distortion performance with B-pictures is 
discussed. 
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5.2 Prediction Modes for B-Pictures 
5.2.1 Overview 

The macroblock modes for B-pictures allow intra and inter coding. The 
intra-mode macroblocks specified for inter pictures are also available for B- 
pictures. The inter-mode macroblocks are especially tailored to B -picture use. 
As for inter pictures, they utilize seven block size types as depicted in Fig. 5.1 
to generate the motion-compensated macroblock prediction signal. In addition, 
the usage of the reference picture set available for predicting the current B- 
picture is suited to its temporally non-causal nature. 
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Figure 5. 1. Block size types for the motion-compensated macroblock prediction signal. 

In contrast to the previously mentioned inter-mode macroblocks which sig- 
nal motion vector data according to its block size as side information, the 
direct-mode macroblock does not require such side information but derives 
reference frame, block size, and motion vector data from the subsequent inter 
picture. This mode linearly combines two prediction signals. One prediction 
signal is derived from the subsequent inter picture, the other from a previous 
picture. 

A linear combination of two motion-compensated prediction signals with 
explicit side information is accomplished by the superposition mode. Existing 
standards with B-pictures utilize the bidirectional mode, which only allows 
the combination of a previous and subsequent prediction signal. The superpo- 
sition mode generalizes this concept and supports not only the already men- 
tioned forward/backward prediction pair, but also forward/forward and back- 
ward/backward pairs. 
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5 . 2.2 Direct Mode 

The direct mode uses bidirectional prediction and allows residual coding of 
the prediction error. The forward and backward motion vectors of this mode 
are derived from the motion vectors used in the corresponding macroblocks 
of the subsequent reference picture. The same number of motion vectors are 
used. To calculate prediction blocks, the forward and backward motion vectors 
are used to obtain appropriate blocks from reference pictures and then these 
blocks are linearly combined. Using multiple reference picture prediction, the 
forward reference picture for the direct mode is the same as the one used for 
the corresponding macroblock in the subsequent inter picture. The forward 
and backward motion vectors for direct mode macroblocks are calculated as 

TRr 

MV f = -MV ( 5 . 1 ) 

T Rp 

TR b - TR d 

MV b = -MV, ( 5 . 2 ) 

l Kd 

where MVp is the forward motion vector, MVb is the backward motion vec- 
tor, and M V represents the motion vectors in the corresponding macroblock 
in the subsequent inter picture. TRd is the temporal distance between the pre- 
vious and the next inter picture, and TRb is the distance between the current 
picture and the previous inter picture. It should be noted that when multiple 
reference picture prediction is used, the reference picture for the motion vec- 
tor predictions is treated as though it were the most recent previous decoded 
picture. Thus, instead of using the temporal reference of the exact reference 
picture to compute the temporal distances TRd and TRb, the temporal ref- 
erence of the most recent previous reference picture is used to compute the 
temporal distances. 

The direct mode in H.264 is improved by weighted blending of the predic- 
tion signal [218]. Video content like music videos and movie trailers make 
frequent use of fading transitions from scene to scene. It is very popular in 
movie trailers to fade each scene to black, and then from black to the next 
scene. Without weighted blending of the prediction signal, both normal fades 
and “fades to-black” are hard to encode well without visible compression ar- 
tifacts. For example, when encoding with a PBBB pattern, the B-pictures in 
position 1 and 3 suffer from quality degradation relative to the B-pictures in 
position 2 and the surrounding inter and intra pictures. The weighted blending 
technique considers how the direct mode motion vectors are derived from scal- 
ing the motion vector for the subsequent inter picture, based on the distance 
between the B-picture and the surrounding pictures, and also weighs the calcu- 
lation of the prediction block based on this distance, instead of the averaging 
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with equal weights that has been used in all existing standards with B-pictures. 
The weighted blending technique calculates the prediction block c for direct 
mode coded macroblocks according to 



c = 



Cp{T R d ~ T Rb) + c s T Rb 
TRd 



( 5 . 3 ) 



where c p is the prediction block from a previous reference picture, and c s is 
the prediction block from the subsequent reference picture. Sequences without 
any fades will not suffer from loss of compression efficiency relative to the 
conventional way to calculate the prediction for direct mode. 



5.2.3 Superposition Mode 

The superposition mode superimposes two macroblock prediction signals 
with their individual sets of motion vectors. We refer to each prediction sig- 
nal as a hypothesis. To calculate prediction blocks, the motion vectors of the 
two hypotheses are used to obtain appropriate blocks from reference pictures 
and then these blocks are averaged. Each hypothesis is specified by one of the 
seven block size types as depicted in Fig. 5.1. In addition, each hypothesis is 
also assigned one picture reference parameter. The motion vectors for each 
hypothesis are assigned on a block level and all of them refer to that specified 
reference picture. It is very likely that the hypotheses are chosen from different 
reference pictures but they can also originate from the same picture. Increas- 
ing the number of available reference pictures improves the performance of 
superimposed motion-compensated prediction as shown in Section 4.3.3 and 
theoretically discussed in Section 3.4.4. For B-pictures, more details are given 
in Section 5.3. 



5.2.4 Rate-Distortion Performance of Individual Modes 

The macroblock modes for B-pictures can be classified into four groups: 

1. No extra side information is transmitted for this particular macroblock. 
This corresponds to the direct mode. 

2. side information for one macroblock prediction signal is transmitted. The 
inter modes with block structures according to Fig. 5.1 and bidirectional 
picture reference parameters belong to this group. 

3. side information for two macroblock prediction signals is transmitted to 
allow superimposed prediction. 
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4 . The last group includes all infra modes, i.e., no inter-frame prediction is 
used. 

In the following, the first three groups which utilize inter-frame prediction are 
investigated with respect to their rate-distortion performance. The fourth group 
with the intra modes is negligible. They are available in each experiment but 
their frequency is very small. 




Figure 5.2. PSNR of the B-picture luminance signal vs. B-picture bit-rate for the QCIF se- 
quence Mobile & Calendar with 30 fps. Two B-pictures are inserted after each inter picture. 5 
past and 3 subsequent reference pictures are used. The compression efficiency of the B-picture 
coding modes direct, inter, and superposition are compared. 

The rate-distortion performance of the groups direct, inter, and superposi- 
tion are depicted in Fig. 5.2. The PSNR of the B-picture luminance signal 
is plotted over the B-picture bit-rate for the QCIF sequence Mobile & Calen- 
dar. With the direct mode for the B-pictures, the rate-distortion performance at 
high bit-rates is dominated by the efficiency of the residual encoding. The inter 
modes improve the compression efficiency approximately by 1 dB in PSNR at 
moderate and high bit-rates. At very low bit-rates, the rate -penalty in effect 
disables the modes in the inter group due to extra side information. Similar 
behavior can be observed for the superposition mode. Transmitting two pre- 
diction signals increases the side information. Consequently, the superposition 
mode improves compression efficiency approximately by 1 dB in PSNR at high 
bit-rates. 

Corresponding to the rate-distortion performance of the three groups, 
Fig. 5.3 depicts the relative occurrence of the macroblock modes in B-pictures 
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Figure 5.3. Relative occurrence of the macroblock modes in B-pictures vs. quantization pa- 
rameter for the QCIF sequence Mobile & Calendar with 30 fps. Two B-pictures are inserted 
after each inter picture. 5 past and 3 subsequent reference pictures are used. The relative fre- 
quency of the B-picture macroblock modes direct, inter, and superposition are compared. 




Figure 5.4. PSNR of the luminance signal vs. overall bit-rate for the QCIF sequence Mobile 
& Calendar with 30 fps. Two B-pictures are inserted after each inter picture. 5 past and 3 
subsequent reference pictures are used. The compression efficiency of the B-picture coding 
modes direct, inter, and superposition are compared. 
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vs. quantization parameter QPp for the QCIF sequence Mobile & Calendar. 
At QPp — 28 (low bit-rate), the direct mode is dominant with approximately 
90% relative occurrence, whereas the superposition and inter modes are sel- 
dom selected due to the rate-constraint. QP P = 12 At (high bit-rate), the 
relative occurrence of the direct mode decreases to 30%, whereas the relative 
frequency of the superposition mode increases to 60%. About 10% of the mac- 
roblocks utilize an inter mode. 

The influence of the B-picture coding modes direct, inter, and superposition 
on the overall compression efficiency is depicted in Fig. 5.4 for the QCIF se- 
quence Mobile & Calendar. The base layer (the sequence of inter pictures) 
is identical in all three cases and only the B-picture coding modes are se- 
lected from the specified classes. For this sequence, the inter modes in the 
B -pictures improve the overall efficiency approximately by 0.5 dB. The super- 
position mode adds an additional 0.5 dB for higher bit-rates. 

5.3 Superimposed Prediction 

5.3.1 Bidirectional vs. Superposition Mode 

In the following, we will outline the difference between the bidirectional 
macroblock mode, which is specified in the H.264 test model TML-9 [212], 
and the superposition mode proposed in [214] and discussed in the previous 
section. A bidirectional prediction type only allows a linear combination of a 
forward/backward prediction pair; see Fig. 5.5. 

P B P B P 




Figure 5.5. A bidirectional prediction mode allows a linear combination of one past and one 
subsequent macroblock prediction signal. The inter pictures are denoted by P. 

The draft TML-9 utilizes multiple reference pictures for forward prediction 
but allows only backward prediction from the most subsequent reference pic- 
ture. For bidirectional prediction, independently estimated forward and back- 
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ward prediction signals are practical but the efficiency can be improved by joint 
estimation. For superimposed prediction in general, a joint estimation of two 
hypotheses is necessary as discussed in Chapter 3. An independent estimate 
might even deteriorate the performance. The test model software TML-9 does 
not allow ajoint estimation of forward and backward prediction signals. 

The supeiposition mode removes the restriction of the bidirectional mode 
to allow only linear combinations of forward and backward pairs [219, 220]. 
The additional combinations (forward, forward) and (backward, backward) are 
obtained by extending an unidirectional picture reference syntax element to a 
bidirectional picture reference syntax element; see Fig. 5.6. With this bidi- 
rectional picture reference element, a generic prediction signal, which we call 
hypothesis, can be formed with the syntax fields for reference frame, block 
size, and motion vector data. 

P B P B P 






Figure 5.6. The superposition mode allows a linear combination of two past macroblock pre- 
diction signals (as depicted), two future macroblock signals, or one past and one future mac- 
roblock signal. The inter pictures are denoted by P. 

The supeiposition mode includes the bidirectional prediction mode when 
the first hypothesis originates from a past reference picture and the second from 
a future reference picture. The bidirectional mode limits the set of possible 
reference picture pairs. Not surprisingly a larger set of reference picture pairs 
improves the coding efficiency of B-pictures. 

The following results are based on the H.264 test model TML-9 [212]. For 
our experiments, the CIF sequences Mobile & Calendar and Flowergarden are 
coded at 30 fps. We investigate the rate-distortion performance of the super- 
position mode in comparison with the bidirectional mode when two B-pictures 
are inserted. 

Figs. 5.7 and 5.8 depict the average luminance PSNR from reconstructed 
B-pictures over the overall bit-rate produced by B-pictures with bidirectional 
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Figure 5. 7. PSNR of the B-picture luminance signal vs. B-picture bit-rate for the CIF sequence 
Mobile & Calendar with 30 fps. Two B-pictures are inserted after each inter picture. QPp = 
QPp. The superposition mode is compared to the bidirectional mode. 
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Figure 5.8. PSNR of the B-picture luminance signal vs. B-picture bit-rate for the CIF sequence 
Flowergarden with 30 fps. T\vo B-pictures are inserted after each inter picture. QPp = QPp. 
The superposition mode is compared to the bidirectional mode. 



prediction mode and the superposition mode for the sequences Mobile & Cal- 
endar and Flowergarden. The number of reference pictures is chosen to be 1 





ITU-T Rec. H.264 and Generalized B-Pictures 



95 



and 3 future reference pictures with a constant number of 5 past pictures. It can 
be observed that increasing the total number of reference pictures from 5 + 1 
to 5 + 3 slightly improves compression efficiency. Moreover, the superposi- 
tion mode outperforms the bidirectional mode and its compression efficiency 
improves for increasing bit-rate. In the case of the bidirectional mode, jointly 
estimated forward and backward prediction signals outperform independently 
estimated signal pairs. 

5.3.2 Two Combined Forward Prediction Signals 

Generalized B-pictures combine both the superposition of prediction sig- 
nals and the reference picture selection from past and future pictures. In the 
following, we investigate generalized B-pictures with forward-only predic- 
tion and utilize them like inter pictures in Chapter 4 for comparison purposes 
[213, 221]. That is, only a unidirectional reference picture parameter which 
addresses past pictures is permitted. As there is no future reference picture, 
the direct mode is replaced by the skip mode as specified for inter pictures. 
The generalized B-pictures with forward-only prediction cause no extra cod- 
ing delay as they utilize only past pictures for prediction and are also used for 
reference to predict future pictures. 



B B B B 




Figure 5.9, Generalized B-pictures with forward-only prediction utilize multiple reference 
picture prediction and superimposed motion-compensated prediction. The superposition mode 
uses two hypotheses chosen from past reference pictures. 

Fig. 5.9 shows generalized B-pictures with forward-only prediction. They 
allow multiple reference picture prediction and linearly combined motion- 
compensated prediction signals with individual block size types. Both hy- 
potheses are just averaged to form the current macroblock. As depicted in 
Fig. 5.1, the H.264 test model [212] allows seven different block sizes which 
will be the seven hypotheses types in the supeiposition mode. The draft H.264 
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standard allows for inter modes only one picture reference parameter per mac- 
roblock and assumes that all sub-blocks can be found on that specified refer- 
ence picture. This is different from the H.263 standard, where multiple ref- 
erence picture prediction utilizes picture reference parameters for both mac- 
roblocks and 8x8 blocks [91]. 




Figure 5.10. Average bit-rate at 35 dB PSNR vs. number of reference pictures for the C1F 
sequence Mobile & Calendar with 30 fps. Generalized B-pictures with forward-only prediction 
are compared to inter pictures. 

We investigate the rate-distortion performance of generalized B-pictures 
with forward-only prediction and compare them to H.264 inter pictures for var- 
ious numbers of reference pictures. Figs. 5.10 and 5.11 show the bit-rate values 
at 35 dB PSNR of the luminance signal over the number of reference pictures 
M for the CIF sequences Mobile & Calendar and Tempete, respectively, coded 
at 30 fps. We compute PSNR vs. bit-rate curves by varying the quantization 
parameter and interpolate intermediate points by a cubic spline. The perfor- 
mance of H.264 inter pictures (IPPP...) and the generalized B-pictures with 
forward-only prediction (IBBB...) is shown. 

The generalized B-pictures with forward-only prediction and M = 1 ref- 
erence picture has to choose both hypotheses from the previous picture. For 
M > 1, we allow more than one reference picture for each hypothesis. The 
reference pictures for both hypotheses are selected by the rate-constrained su- 
perimposed motion estimation algorithm described in Section 5.4.3. The pic- 
ture reference parameter allows also the special case that both hypotheses are 
chosen from the same reference picture. The rate constraint is responsible for 
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Figure 5.11. Average bit-rate at 35 dB PSNR vs. number of reference pictures for the CIF se- 
quence Tempeie with 30 fps. Generalized B-pictures with forward-only prediction are compared 
to inter pictures. 

the trade-off between prediction quality and bit-rate. Using the generalized 
B-pictures with forward-only prediction and M - 10 reference pictures re- 
duces the bit-rate from 2019 to 1750 kbit/s when coding the sequence Mobile 
& Calendar. This corresponds to 13% bit-rate savings. The gain by the gener- 
alized B-pictures with forward-only prediction and just one reference picture 
is limited to 6%. The gain by the generalized B-pictures over the inter pictures 
improves for a increasing number of reference pictures as already discussed in 
Section 4.3.3 for H.263. This observation is independent of the implemented 
superimposed prediction scheme and is supported by the theoretical investiga- 
tion in Section 3.4.4. 

Figs. 5.12 and 5.13 depict the average luminance PSNR from reconstructed 
pictures over the overall bit-rate produced by H.264 inter pictures (IPPP...) 
and the generalized B-pictures with forward prediction only (IBBB...) for the 
sequences Mobile & Calendar and Tempete, respectively. The number of ref- 
erence pictures is chosen to be M = 1 and 5. It can be observed that the gain 
by generalized B-pictures improves for increasing bit -rate. 

5.3.3 Entropy Coding 

Entropy coding for H.264 B-pictures can be carried out in one of two dif- 
ferent ways: universal variable length coding (UVLC) or context-based adap- 
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Figure 5.12. PSNR of the luminance signal vs. overall bit-rate for the CIF sequence Mobile 
& Calendar with 30 fps. Generalized B-pictures with forward-only prediction are compared to 
inter pictures. 




Figure 5.13. PSNR of the luminance signal vs. overall bit-rate for the CIF sequence Tempete 
with 30 fps. Generalized B-pictures with forward-only prediction are compared to inter pictures. 



five binary arithmetic coding (CAB AC) [222-224]. The UVLC scheme uses 
only one variable length code to map all syntax elements to binary representa- 
tions whereas CAB AC utilizes context modeling and adaptive arithmetic codes 
to exploit conditional probabilities and non-stationary symbol statistics [223]. 
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The simplicity of the UVLC scheme is striking as it demonstrates good com- 
pression efficiency at very low computational costs. CAB AC with higher com- 
putational complexity provides additional bit-rate savings mainly for low and 
high bit-rates. 




Figure 5.14. PSNR of the B-picture luminance signal vs. B-picture bit-rate for the CIF se- 
quence Mobile & Calendar with 30 fps. Two B-pictures are inserted after each inter picture. 
5 past and 3 future inter pictures are used for predicting each B-picture. QPb — QPp + 2 
and Xfl = 4w(QPp). The superposition mode and the bidirectional mode with independent 
estimation are compared for both entropy coding schemes. 

Figs. 5.14 and 5.15 depict the B-picture compression efficiency for the CIF 
sequences Mobile & Calendar and Flowergarden, respectively. For motion- 
compensated prediction, 5 past and 3 future inter pictures are used in all cases. 
The supeiposition mode and the bidirectional mode with independent estima- 
tion of prediction signals are compared for both entropy coding schemes. The 
PSNR gains by the supeiposition mode and the CABAC scheme are compa- 
rable for the investigated sequences at high bit-rates. When enabling the su- 
peiposition mode together with CABAC, additive gains can be observed. The 
superposition mode improves the efficiency of motion-compensated prediction 
and CABAC optimizes the entropy coding of the utilized syntax elements. 

The syntax elements used by the supeiposition mode can be coded with both 
the UVLC and the CABAC scheme. When using CABAC for the supeiposition 
mode, the context model for the syntax element motion vector data is adapted 
to superimposed motion. The context model for the motion vector of the first 
hypothesis captures the motion activity of the spatial neighbors, i.e. the left and 
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Figure 5.15. PSNR of the B-picture luminance signal vs. B-picture bit-rate for the CIF se- 
quence Flowergarden with 30 fps. Two B-pictures are inserted after each inter picture. 5 past 
and 3 future inter pictures are used for predicting each B-picture. QPb = QPp +2 and 
A# = Aw(QPp). The superposition mode and the bidirectional mode with independent esti- 
mation are compared for both entropy coding schemes. 

top neighboring blocks. The context model for the motion vector of the second 
hypothesis captures the motion activity of the first hypothesis. The context 
models for the remaining syntax elements are not altered. The experimental 
results show that generalizing the bidirectional mode to the superposition mode 
improves B-picture compression efficiency for both schemes. 

5.4 Encoder Issues 

5.4.1 Rate-Constrained Mode Decision 

The test model TML-9 distinguishes between a low- and high-complexity 
encoder. For a low-complexity encoder, computationally inexpensive rules for 
mode decision are recommended [225]. For a high-complexity encoder, the 
macroblock mode decision is ruled by minimizing the Lagrangian function 

J\ (Mode | QP , k) = SSD (Mode | QP ) + kR( Mode | QP), (5.4) 

where QP is the macroblock quantizer parameter, and k the Lagrange multi- 
plier for mode decision. Mode indicates the selection from the set of potential 
coding modes. SSD is the sum of the squared differences between the origi- 
nal block and its reconstruction. It also takes into account the distortion in the 
chrominance components. R is the number of bits associated with choosing 
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Mode and QP, including the bits for macroblock header, motion information, 
and all integer transform blocks. The Lagrangian multiplier for X is related to 
the macroblock quantizer parameter QP by 



X:=w(QP)=5 



QP + 5 
34 - QP 




(5.5) 



Detailed discussions of this relationship can be found in [121] and [123]. Ex- 
perimental results in Section 5.4.4 confirm that this relation should be adapted 
for B-pictures as specified in the test model TML-9, 



X B = 4w(QP B ), 



(5.6) 



such that the overall rate-distortion efficiency for the sequence is improved. 

Mode decision selects the best mode among all B -picture macroblock 
modes. In order to reduce encoder complexity, mode decision assumes for all 
inter modes pre-computed motion vectors which are determined independently 
by rate-constrained motion estimation. 



5.4.2 Rate- Constrained Motion Estimation 

Motion estimation is also performed in a rate-constrained framework. The 
encoder minimizes the Lagrangian cost function 

J 2 (m, r | k SAD , p) = SAD(m, r ) + k SAD /?(/n - p, r), (5.7) 

with the motion vector m, the predicted motion vector p, the reference frame 
parameter r, and the Lagrange multiplier Ag A jy for the SAD distortion mea- 
sure. The rate term R represents the motion information and the number of 
bits associated with choosing the reference picture r. The rate is estimated by 
table-lookup using the universal variable length code (UVLC) table, even if 
the arithmetic entropy coding method is used. Lor integer-pixel search, SAD 
is the summed absolute difference between the original luminance signal and 
the motion-compensated luminance signal. In the sub-pixel refinement search, 
the Hadamard transform of the difference between the original luminance sig- 
nal and the motion-compensated luminance signal is calculated and SAD is 
the sum of the absolute transform coefficients. The Hadamard transform in the 
sub-pixel search reflects the performance of the integer transform on the resid- 
ual signal such that the expected reconstruction quality rather than the motion- 
compensated prediction quality is taken into account for the refinement. This 
favors sub-pixel positions with residuals that are highly correlated for a given 
summed distortion. The Lagrangian multiplier Ag A jy for the SAD distortion 
measure is related to the Lagrangian multiplier for the SSD measure (5.5) by 

A SAD = (5.8) 
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Further details as well as block size issues for motion estimation are discussed 
in [121] and [123]. 

5.4.3 Rate-Constrained Superimposed Motion Estimation 

For the supeiposition mode, the encoder utilizes rate-constrained superim- 
posed motion estimation. The cost function incorporates the superimposed pre- 
diction error of the video signal as well as the bit-rate for two picture reference 
parameters, two hypotheses types, and the associated motion vectors. Rate- 
constrained superimposed motion estimation is performed by the hypothesis 
selection algorithm in Fig. 2.6. This iterative algorithm performs conditional 
rate-constrained motion estimation and is a computationally feasible solution 
to the joint estimation problem [17] which has to be solved for finding an effi- 
cient pair of hypotheses. 

The iterative algorithm is initialized with the data of the best macroblock 
type for multiple reference prediction (initial hypothesis). For two hypotheses, 
the algorithm continues with: 

1. One hypothesis is fixed and conditional rate-constrained motion estima- 
tion is applied to the complementary hypothesis such that the superposition 
costs are minimized. 

2. The complementary hypothesis is fixed and the first hypothesis is opti- 
mized. 

The two steps (= one iteration) are repeated until convergence. For the cur- 
rent hypothesis, conditional rate-constrained motion estimation determines the 
conditional optimal picture reference parameter, hypothesis type, and associ- 
ated motion vectors. For the conditional optimal motion vectors, an integer-pel 
accurate estimate is refined to sub-pel accuracy. 

Fig. 5.16 shows the average number of iterations for superimposed motion 
estimation with 5 reference pictures over the quantization parameter. On aver- 
age, it takes about 2 iterations to achieve a Lagrangian cost smaller than 0.5% 
relative to the cost in the previous iteration. The algorithm converges faster for 
higher quantization parameter values. 

Given the best single hypothesis for motion-compensated prediction (best 
inter mode) and the best hypothesis pair for superimposed prediction, the re- 
sulting prediction errors are transform coded to compute the Lagrangian costs 
for the mode decision. 
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Figure 5.16. Average number of iterations for superimposed motion estimation vs. quantiza- 
tion parameter for the CIF sequence Mobile & Calendar with 30 fps and M = 5 reference 
pictures. 

Superimposed motion-compensated prediction improves the prediction sig- 
nal by allocating more bits to the side information associated with the motion- 
compensating predictor. But the encoding of the prediction error and its as- 
sociated bit-rate also determines the quality of the reconstructed macroblock. 
A joint optimization of superimposed motion estimation and prediction error 
coding is far too demanding. But superimposed motion estimation indepen- 
dent of prediction error encoding is an efficient and practical solution if rate- 
constrained superimposed motion estimation is applied. 

It turns out that the superposition mode is not necessarily the best one for 
each macroblock. Therefore, the rate-distortion optimal mode selection is a 
very important tool to decide whether a macroblock should be predicted with 
one or two hypotheses. 

Fig. 5.17 shows the relative occurrence of the superposition mode in gen- 
eralized B-pictures over the quantization parameter for the CIF sequence Mo- 
bile & Calendar. 5 past and 3 future reference pictures are used. Results for 
both entropy coding schemes are plotted. For high bit-rates (small quantiza- 
tion parameters), the supeiposition mode exceeds a relative occurrence of 50% 
among all B-picture coding modes. For low bit-rates (large quantization pa- 
rameters), the supeiposition mode is selected infrequently and, consequently, 
the improvement in coding efficiency is very small. In addition, the relative 
occurrence is slightly larger for the CABAC entropy coding scheme since the 
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Figure 5.17. Relative occurrence of the superposition mode in B-pictures vs. quantization 
parameter for the CIF sequence Mobile & Calendar with 30 fps. 5 past and 3 future reference 
pictures are used. QPg = QPp +2. 

more efficient CABAC scheme somewhat relieves the rate constraint imposed 
on the side information. 

5.4.4 Improving Overall Rate-Distortion Performance 

When B-pictures are considered to be an enhancement layer in a scalable 
representation, they are predicted from reference pictures that are provided by 
the base layer. Consequently, the quality of the base layer influences the rate- 
distortion trade-off for B-pictures in the enhancement layer. Experimental re- 
sults show that the relationship between quantization and Lagrange parameter 
for mode decision A = w(QP) should be adapted [226]. The following ex- 
perimental results are obtained with the test model software TML-9, i.e., with 
bidirectional prediction and independent estimation of forward and backward 
prediction parameters. 

Fig. 5.18 shows the PSNR of the luminance signal vs. overall bit-rate for 
the QCIF sequence Mobile & Calendar with 30 fps. Three different A - QP 
dependencies are depicted. The worst compression efficiency is obtained with 
X B = w(QP b ). The cases A** = 4w(QPb) and A b = %u>(QPb) demonstrate 
superior efficiency for low bit-rates. The scaling of the dependency alters the 
bit-rate penalty for all B-picture coding modes such that the overall compres- 
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Figure 5. 18. PSNR of the luminance signal vs. overall bit-rate for the QCIF sequence Mobile 
& Calendar with 30 fps. Two B-pictures are inserted and the influence of the kg - QPg 
relationship on the overall compression efficiency is investigated. 



sion efficiency is improved. The factor 4 is suggested in the current test model 
description. 




Figure 5. 19. PSNR of the luminance signal vs. overall bit-rate for the QCIF sequence Mobile 
<& Calendar with 30 fps. Two B-pictures are inserted and the influence of the B-picture quantiza- 
tion parameter QPg on the overall compression efficiency is investigated for X g = 4t u(QPg). 
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Further experiments show that not only the relationship between quanti- 
zation and Lagrange parameter for mode decision has to be adapted for B- 
pictures but also the PSNR of the enhancement layer should be lowered in 
comparison to the base layer to improve overall compression efficiency [226]. 
Fig. 5.19 depicts also the PSNR of the luminance signal vs. overall bit-rate 
for the QCIF sequence Mobile & Calendar with 30 fps. The plot compares 
the compression efficiency of various layered bit-streams with two inserted 
B-pictures. The quantization parameters of inter and B-pictures differ by a 
constant offset. For comparison, the efficiency of the single layer bit-stream 
is provided. Increasing the quantization parameter for B-pictures, that is, low- 
ering their relative PSNR, improves the overall compression efficiency of the 
sequence. 




Figure 5.20. PSNR of the luminance signal for individual pictures. T\vo B-pictures are in- 
serted. The B-picture quantization parameter QPg is incremented by 2 and the B-picture La- 
grange parameter k q =4w(QPq). QP p = 14. 



Fig. 5.20 shows the PSNR of the luminance signal for individual pictures of 
the sequence Mobile & Calendar encoded with QPp = 14, The PSNR of the 
B-pictures encoded with an increment of 2 is significantly lower compared to 
the case with identical quantization parameter in both layers. The compression 
efficiency of the sequence increases by lowering the relative PSNR of the en- 
hancement layer. For the investigated sequence, the average PSNR efficiency 
increases by almost 1 dB (see Fig. 5.19), whereas the PSNR of individual B- 
pictures drops by more than 1 dB. In this case, higher average PSNR with tem- 
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poral fluctuations is compared to lower average PSNR with less fluctuations 
for a given bit-rate. 




Figure 5.21. PSNR of the luminance signal vs. overall bit-rate for the QCIF sequence Foreman 
with 30 fps. When replacing two inter pictures by B-pictures, the values QPg and Xr have to 
be adapted for best compression efficiency. Keeping the inter picture values may lower the 
efficiency. 

Fig. 5.21 shows the PSNR of the luminance signal vs. overall bit-rate for 
the QCIF sequence Foreman with 30 fps. The depicted results demonstrate 
that without adapting the quantization parameter and the X - Q P dependency 
for B-pictures, we observe a degradation in compression efficiency if two inter 
pictures are replaced by B-pictures, whereas the adaptation improves the PSNR 
by about 0.5 dB for a given bit-rate. 

5.5 Conclusions 

This chapter discusses B-pictures in the context of the draft H.264 video 
compression standard. We focus on reference picture selection and linearly 
combined motion-compensated prediction signals. We show that bidirectional 
prediction only partially exploits the efficiency of combined prediction signals 
whereas superimposed prediction allows a more general form of B-pictures. 
The general concept of linearly combined prediction signals chosen from an 
arbitrary set of reference pictures further improves the H.264 test model TML- 
9 which is used in this chapter. 

We outline H.264 macroblock prediction modes for B-pictures, classify 
them into four groups and compare their efficiency in terms of rate-distortion 
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performance. When investigating superimposed prediction, we show that bidi- 
rectional prediction is a special case of this concept. Superimposed prediction 
allows also two combined forward prediction signals. Experimental results 
show that this case is also advantageous in terms of compression efficiency. 
The draft H.264 video compression standard offers improved entropy coding 
by context-based adaptive binary arithmetic coding. Simulations show that 
the gains by superimposed prediction and arithmetic coding are additive. B- 
pictures establish an enhancement layer and are predicted from reference pic- 
tures that are provided by the base layer. The quality of the base layer in- 
fluences the rate-distortion trade-off for B-pictures. We demonstrate how the 
quality of the B-pictures should be reduced in order to improve the overall 
rate-distortion performance of the scalable representation. 

Conceptually, we differentiate between picture reference selection and lin- 
early combined prediction signals. This distinction is reflected in the term 
Generalized B-Pictures. The feature of reference picture selection has been im- 
proved significantly when compared to existing video compression standards. 
But the draft H.264 video compression standard has been extended recently to 
support the features of linearly combined prediction signals that are described 
in this chapter. 

Towards a generalized picture type, a desirable definition would make a dis- 
tinction whether only past reference pictures or also future reference pictures 
are used for prediction. If both past and future reference pictures are available, 
this generalized picture type would utilize the direct and superposition mode, 
whereas if only past reference frames are available, this generalized picture 
type would replace the direct mode by the copy mode. 




Chapter 6 



MOTION COMPENSATION 
FOR GROUPS OF PICTURES 



6.1 Introduction 

So far, we discussed video coding schemes that utilize inter-frame methods 
with motion-compensated prediction (MCP) for efficient compression. Such 
compression schemes require sequential processing of video signals which 
makes it difficult to achieve efficient embedded representations of video se- 
quences. This chapter investigates the efficiency of motion-compensated 3D 
transform coding, a compression scheme that employs a motion-compensated 
transform for groups of pictures. We investigate this coding scheme experi- 
mentally and theoretically. The practical coding scheme employs in temporal 
direction a motion-compensated subband decomposition for each group of pic- 
tures. We also compare the experimental results to that of a predictive video 
codec with motion compensation and comparable complexity. The theoretical 
investigation models this motion-compensated subband coding scheme for a 
group of K pictures with a signal model for K motion-compensated pictures 
that are decorrelated by a linear transform. We utilize the Karhunen-Loeve 
Transform to obtain theoretical performance bounds at high bit -rates and com- 
pare to both optimum intra-frame coding of individual motion-compensated 
pictures and motion-compensated predictive coding. 

The outline of this chapter is as follows: Section 6.2 discusses the motion- 
compensated subband coding scheme with a motion-compensated lifted Haar 
wavelet and a motion-compensated lifted 5/3 wavelet. Experimental results 
and comparisons for several test sequences are presented. The theoretical 
signal model is developed in Section 6.3. The motion-compensated wavelet 
kernels of the practical coding schemes are generalized to obtain theoretical 
performance bounds. 
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6.2 Coding Scheme 

Applying a linear transform in temporal direction of a video sequence is not 
very efficient if significant motion is prevalent. However, a combination of a 
linear transform and motion compensation seems promising for efficient com- 
pression. For wavelet transforms, the so called Lifting Scheme [156] can be 
used to construct the kernels. A two-channel decomposition can be achieved 
with a sequence of prediction and update steps that form a ladder structure. 
The advantage is that this lifting structure is able to map integers to integers 
without requiring invertible lifting steps. Further, motion compensation can be 
incorporated into the prediction and update steps as proposed in [20]. The fact 
that the lifting structure is invertible without requiring invertible lifting steps 
makes this approach feasible. We cannot count on motion compensation to be 
invertible in general. If it is invertible, this motion-compensated wavelet trans- 
form based on lifting permits a linear transform along the motion trajectories 
in a video sequence. 

In the following, we investigate coding schemes that process video se- 
quences in groups of K pictures (GOP). First, we decompose each GOP in 
temporal direction. The dyadic decomposition utilizes a motion-compensated 
wavelet which will be discussed later in more detail. The temporal transform 
provides K output pictures that are intra-frame encoded. In order to allow a 
comparison to a basic predictive coder with motion compensation, we utilize 
for the intra-frame coder a 8 x 8 DCT with run-length coding. If we employ a 
Haar wavelet and set the motion vectors to zero, the dyadic decomposition will 
be an orthonormal transform. Therefore, we select the same quantizer step- 
size for all K intra-frame encoder. The motion information that is required for 
the motion-compensated wavelet transform is estimated in each decomposition 
level depending on the results of the lower level. Further, we employ half-pel 
accurate motion compensation with bi-linear interpolation. 

6.2.1 Motion-Compensated Lifted Haar Wavelet 

First, we discuss the lifting scheme with motion compensation for the Haar 
wavelet [20]. Fig. 6.1 depicts a Haar transform with motion-compensated lift- 
ing steps. The even frames of the video sequence S 2 * are used to predict the 
odd frames S 2 *r+i (prediction step P). The prediction step is followed by an 
update step U. 

If the motion field between the even and odd frames is invertible, the corre- 
sponding motion vectors in the update and prediction steps sum to zero. We use 
a block-size of 16 x 16 and half-pel accurate motion compensation in the pre- 
diction step and select the motion vectors such that they minimize the squared 
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Figure 6. 1. Haar transform with motion-compensated lifting steps. Both steps, prediction P 
and update U, utilize block-based motion compensation. 

error in the high-band h*. In general, this block-motion field is not invertible 
but we still utilize the negative motion vectors for the update step as an ap- 
proximation. Additional scaling factors in low- and high-band are necessary to 
normalize the transform. 

6.2.2 Motion-Compensated Lifted 5/3 Wavelet 

The Haar wavelet is a short filter and provides limited coding gain. We 
expect better coding efficiency with longer wavelet kernels. In the following, 
we discuss the lifted 5/3 wavelet kernel with motion compensation [20]. 

lr 



K 



L+i 



Figure 6.2. Lifted 5/3 wavelet with motion compensation. 

Fig. 6.2 depicts the 5/3 transform with motion-compensated lifting steps. 
Similar to the Haar transform, the update steps L^+i use the negative motion 
vectors of the corresponding prediction steps. But for this transform, the odd 
frames are predicted by a linear combination of two displaced neighboring 
even frames. Again, we use a block-size of 16 x 16 and half-pel accurate 
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motion compensation in the prediction steps and choose the motion vectors for 
Pik+\ an d Pi K + 2 such that they minimize the squared error in the high-band h* . 
The corresponding update steps U^k+i use also the negative motion vectors of 
the corresponding prediction steps. 

6.2.3 Experimental Results 

For the experiments, we subdivide the QCIF sequences Mother & Daugh- 
ter, Container Ship, Salesman, Mobile & Calendar, Foreman, News, and Car 
Phone, each with 288 frames at 30 fps, into groups of K pictures. We decom- 
pose the GOPs independently and in the case of the 5/3 wavelet, we refer back 
to the first picture in the GOP when the GOP terminates. We will justify later, 
why we choose this cyclic extension to handle the GOP boundaries. 

Figs. 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, and 6.9 show luminance PSNR over the 
total bit-rate for the test sequences encoded for groups of K = 2, 8, 16, and 32 
pictures with the Flaar kernel and for groups of K = 32 with the 5/3 kernel. 




Figure 6.3. Luminance PSNR vs. total bit-rate for the QCIF sequence Mother & Daughter at 
30 fps. A dyadic decomposition is used to encode groups of K = 2, 8, 16, and 32 pictures with 
the Haar kernel, and K — 32 with the 5/3 kernel. Results for a basic predictive video codec 
with 287 inter-frames are given for reference. 

We observe that the bit-rate savings with the Flaar kernel diminish very 
quickly as the GOP size approaches 32 pictures. Note also that the 5/3 de- 
composition with a GOP size of 32 outperforms the Flaar decomposition with 
a GOP size of 32. For the sequences Mother & Daughter, Container Ship, 






Figure 6.5. Luminance PSNR vs. total bit-rate for the QCIF sequence Salesman at 30 fps. A 
dyadic decomposition is used to encode groups of K = 2, 8, 16, and 32 pictures with the Haar 
kernel, and K = 32 with the 5/3 kernel. Results for a basic predictive video codec with 287 
inter-frames are given for reference. 






Figure 6.6. Luminance PSNR vs. total bit-rate for the QCIF sequence Mobile & Calendar at 
30 fps. A dyadic decomposition is used to encode groups of K = 2, 8, 16, and 32 pictures with 
the Haar kernel, and K = 32 with the 5/3 kernel. Results for a basic predictive video codec 
with 287 inter-frames are given for reference. 




Figure 6.7. Luminance PSNR vs. total bit-rate for the QCIF sequence Foreman at 30 fps. A 
dyadic decomposition is used to encode groups of K = 2, 8, 16, and 32 pictures with the Haar 
kernel, and K — 32 with the 5/3 kernel. Results for a basic predictive video codec with 287 
inter-frames are given for reference. 
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Figure 6.8. Luminance PSNR vs. total bit-rate for the QCIF sequence News at 30 fps. A 
dyadic decomposition is used to encode groups of K = 2, 8, 16, and 32 pictures with the Haar 
kernel, and K = 32 with the 5/3 kernel. Results for a basic predictive video codec with 287 
inter-frames are given for reference. 




Figure 6.9. Luminance PSNR vs. total bit-rate for the QCIF sequence Car Phone at 30 fps. A 
dyadic decomposition is used to encode groups of K =2, 8, 16, and 32 pictures with the Haar 
kernel, and K = 32 with the 5/3 kernel. Results for a basic predictive video codec with 287 
inter-frames are given for reference. 
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and Salesman, the Haar wavelet coding scheme with K = 32 performs sim- 
ilar to a comparable basic predictive video codec (intra- and inter-frames, 
16 x 16 block-motion compensation, half-pel accuracy, previous reference 
picture only, and 8x8 DCT) with a very large GOP size. Please note that 
for Mobile & Calendar at lower bit-rates the Haar wavelet coding scheme out- 
performs the predictive video codec. The 5/3 decomposition with a GOP size 
of 32 outperforms not only the corresponding Haar decomposition but also the 
basic predictive video coding scheme with a GOP size of K = 288. For the 
sequences Foreman, News, and Car Phone, the 5/3 wavelet coding scheme per- 
forms comparable or slightly better than the predictive video codec. These se- 
quences contain inhomogeneous motion and we suspect that the use of negative 
motion vectors in the update step permits only an insufficient approximation. 

Further, we investigate the behavior of the coding scheme for the cases that 
the frames are degraded by additive noise. For that, we generate the sequence 
Noisy Foreman by repeating 32 times the first frame of the sequence Foreman 
and adding statistically independent white Gaussian noise of variance 25. As 
we investigate the residual noise only, this sequence contains no motion. Pre- 
dictive coding with motion compensation is not capable of predicting the ad- 
ditive noise in the current frame. In fact, prediction doubles the noise variance 
in the residual signal and we expect that predictive coding performs inferior. 




Figure 6.10. Luminance PSNR vs. total bit-rate for the QCIF sequence Noisy Foreman at 30 
fps. A dyadic decomposition is used to encode groups of K = 32 pictures with the motion- 
compensated Haar wavelet. Results for a basic predictive video codec with 31 inter-frames are 
given for reference. 
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Fig. 6.10 shows luminance PSNR over the total bit-rate for the sequence 
Noisy Foreman. The coding scheme with the Haar wavelet kernel and a dyadic 
decomposition of 32 pictures is compared to the predictive coding scheme. 
We observe that the wavelet coding scheme outperforms the predictive coding 
scheme by approximately 2 dB. The predictive coding scheme is inferior as the 
statistically independent noise in the current frame cannot be predicted from 
the previous frame. 

Finally, we discuss briefly our GOP boundary handling for the 5/3 wavelet. 
As we encode the GOPs in the sequence independently, we have to solve the 
boundary problem for the 5/3 wavelet. For the discussion, we consider cyclic 
and symmetric extensions at the GOP boundary. Note that a GOP begins with 
the even picture So and terminates always with an odd picture. When the GOP 
terminates, the cyclic extension refers back to the first picture So, and the sym- 
metric extension uses the last even picture in the GOP twice as a reference. In 
the case of the cyclic extension, the terminal update step modifies the first pic- 
ture So, and in the case of the symmetric extension, the last even picture in the 
GOP is updated twice. We implemented both extensions for an experimental 
comparison of the rate-distortion performance. 




Figure 6.11. Luminance PSNR vs. total bit-rate for the QCIF sequence Container Ship at 30 
fps. A dyadic decomposition is used to encode groups of K = 32 pictures with the motion- 
compensated 5/3 wavelet. The compression efficiency of cyclic and symmetric extension is 
depicted. 

Fig. 6.11 shows luminance PSNR over the total bit-rate for the sequence 
Container Ship. We observe that both extensions show similar rate-distortion 
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performance. We expect better efficiency for the cyclic extension as it is a 
multi-frame approach. But it seems that the large temporal distance is disad- 
vantageous. As we observe a small advantage in terms of compression effi- 
ciency, also for other sequences like Foreman, we use the cyclic extension for 
our experiments. 

6.3 Mathematical Model of Motion- Compensated 
Subband Coding 

The experimental results show that the temporal subband coding scheme 
can provide superior compression efficiency when compared to the predictive 
coding scheme. In the following, we outline a mathematical model to study 
motion-compensated subband coding in more detail. With that, we derive per- 
formance bounds for motion-compensated three-dimensional transform coding 
and compare to bounds known for motion-compensated predictive coding. 

Lets* = ]s*[/], / € I!} be scalar random fields over a two-dimensional 
orthogonal grid n with horizontal and vertical spacing of 1 . The vector / = 
( x , y) T denotes a particular location in the lattice TI. We interpret s* as the 
A ill of K pictures to be encoded. Further, the signal s*[/] is thought of as 
samples of a space-continuous, spatially band-limited signal and we obtain a 
displaced version of it as follows: We shift the ideal reconstruction of the band- 
limited signal by the continuous-valued displacement vector d and re-sample it 
on the original grid. With this signal model, a spatially constant displacement 
operation is invertible. 

6.3.1 Motion-Compensated Lifted Haar Wavelet 

With the above signal model, we revisit the motion-compensated lifted Haar 
wavelet in Fig. 6. 1 and remove the displacement operators in the lifting steps 
such that we can isolate a lifted Haar wavelet without displacement operators. 

Fig. 6.12 shows the equivalent Haar wavelet where the displacement oper- 
ators are pre- and post-processing operators with respect to the original Haar 
transform. The schemes in Fig. 6.12 are equivalent, if the displacement opera- 
tors are linear and invertible. 

We continue and perform the dyadic decomposition of a GOP with the 
equivalent Haar wavelet. For that, the displacements of the equivalent Haar 
blocks have to be added. We assume that the estimated displacements between 
pairs of frames are additive such that, e.g., do 2 + <5?23 = d<n- As the true dis- 
placements are also additive, e.g. t/o 2+^23 = ^ 03 . and differ from the estimated 
displacement by the displacement error, i.e. djj — d t] + A;j, we conclude that 
the displacement errors are also additive, e.g. A 02 + A 23 = A 03 , [227]. 
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Figure 6.12. Haar transform with lifting steps that shift the signal (top). As the shift operation 
is invertible, an equivalent system without shifts in the lifting steps is possible (bottom). 




Figure 6. 13. Dyadic Haar Transform (DHT) without shifts in the lifting steps for K =4 
pictures. 



Fig. 6.13 depicts a dyadic decomposition for K -4 pictures based on the 
equivalent Haar wavelet in Fig. 6.12. The dyadic Haar transform without dis- 
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placements in the lifting steps is labeled by DHT. The displacements dok are 
pre- and post-processing operators with respect to the original dyadic Haar 
decomposition DHT. 

6.3.2 Motion-Compensated Lifted 5/3 Wavelet 

We also apply the invertible displacement operator to the motion-compen- 
sated lifted 5/3 wavelet in Fig. 6.2 and obtain the equivalent 5/3 wavelet in 
Fig. 6.14. 




Figure 6. 14. 5/3 wavelet with lifting steps that shift the signal (top). As the shift operation is 

invertible, an equivalent system without shifts in the lifting steps is possible (bottom). 
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Due to the structure of the 5/3 wavelet, we have displacements between 
the frames 2k & 2k + 1 , 2k + 2 & 2k + 1 , and 2k & 2k + 2 (in the next 
decomposition level). Again, we assume that the estimated displacements are 
additive such that, e.g., c/o i — = do2. With this assumption, the displacement 
operators between the levels cancel out and several decomposition levels are 
possible without displacements between the levels. 

The equivalent dyadic 5/3 transform has the same pre- and post-processing 
displacement operators as the equivalent dyadic Haar transform in Fig. 6.13 
but the DHT is replaced by the original dyadic 5/3 decomposition as depicted 
in Fig. 6.15. 




Figure 6.15. Dyadic 5/3 Transform (D5/3T) without shifts in the lifting steps for K =4 
pictures. 



6.3.3 Signal Model 

Now, we assume that the pictures s* are shifted versions of a “clean” video 
signal v with the true displacements dok and distorted by independent additive 
white Gaussian noise n^. Combining this signal model with the equivalent 
dyadic decomposition, we can eliminate the absolute displacements and re- 
strict ourselves to the displacement error A cut in the A-th picture. In the follow- 
ing, we do not consider particular displacement errors Aot- We rather specify 
statistical properties and consider them as random variables A&, statistically 
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independent from the “clean” signal v and the noise n^. The noise signals 
and n„ are also mutually statistically independent. 




Figure 6. 16. Motion compensation for a group of K pictures. 



Fig. 6.16 depicts the generalized model with the displacement-free and lin- 
ear transform T for a group of K pictures. The motion-compensated pictures 
Ci»...>Cjc_i are aligned with respect to the first picture Co. According to 
Fig. 6.13, the signals z* are independently intra-frame encoded. As the ab- 
solute displacements have no influence on the performance of the intra-frame 
encoder, we omit them and consider only the direct output signals y * of T. 



Now, assume that the random fields v and c k are jointly wide-sense sta- 
tionary with the real-valued scalar two-dimensional power spectral densities 
<F vv (a>) and 4> C(iCy ((u). The power spectral densities 4> C(jCi ,(ct>) are elements 
in the power spectral density matrix of the motion-compensated pictures $ cc . 
The power spectral density matrix of the decorrelated signal <f> vy is given by 
<t> cc and the transform T, 



<&yy (co) = T (co)<& cc (co)T H (co) 



(6.1) 
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where T H denotes the Hermitian conjugate of T and co = (a> x , co y ) T the vector- 
valued frequency. 

We adopt the expressions for the cross spectral densities 4> c jCu from [11] 

*c„c» = E { e -^ r (A,-A„) j^w^) + *^^(0,) (6.2) 

and assume a normalized power spectrum <t» vv with of = 1 that corresponds to 
an exponentially decaying isotropic autocorrelation function with a correlation 
coefficient between horizontally and vertically adjacent pixels of p v = 0.93. 

For the A-th displacement error A*, a 2-D normal distribution with variance 
o\ and zero mean is assumed where the x- and y-components are statistically 
independent. The expected value in (6.2) depends on the variance of the dis- 
placement error with respect to the reference picture Co ( absolute displacement 
accuracy ) and the variance of the difference displacement error between pairs 
of non-reference pictures ( relative displacement accuracy). We assume that 
each picture in a GOP can be the reference picture Co. That is, there is no 
preference among the pictures in a GOP and the variances of the absolute dis- 
placement error are the same for all K - 1 motion-compensated pictures. Based 
on the dyadic decomposition with motion-compensated lifted wavelets and the 
assumption that there is no preference among the pictures in a GOP, we assume 
that absolute and relative displacement accuracy are identical. The differences 
of absolute displacement errors are related to the relative displacement errors 
as we assume in Sections 6.3.1 and 6.3.2 additive estimated displacements. 

Ao; — Ao; = Aj \j (6.3) 

With that, we obtain for the variances of the absolute and relative displacement 
error components: 



£{(A 0j - A 0i ) 2 } = £{Aj} (6.4) 

2oi(l - Pa) = a l (6-5) 

This is only possible with correlated displacement errors such that = 0.5 
[228]. Finally, we abbreviate the expected value in (6.2) with P(co, ct^) which 
is the characteristic function of the continuous 2-D Gaussian displacement er- 
ror. 

E {e~ ju,T * k j := P(co, o\) = (6.6) 
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With that, we obtain for the power spectral density matrix of the motion- 
compensated pictures 

^ 1 +a(a>) P{(o) ■■■ P{co) 

<t> cc (m) P(te) \ + ot (to) ■■■ P(co) 

: 

y P(a>) P(co) ■■■ 1 4 -a(u>) 

a = a(m) is the normalized spectral density of the noise (&>) with respect 
to the spectral density of the “clean” video signal. 

a(co) = for k = 0, 1 K - 1 (6.8) 

d) vv (w) 

T represents the dyadic Haar transform or the dyadic 5/3 transform. In 
terms of decorrelation and coding gain, the 5/3 wavelet performs better than 
the Haar wavelet as shown in Figs. 6.3 - 6.9. In the following, we are interested 
in theoretical performance bounds and choose the Karhunen-Loeve Transform 
(KLT). The normalized eigenvalues of the power spectral density matrix <t> cc 

are Ai(rn) = 1 + a{oS) + (K — 1 )P(co) and A. 2, 3 yr(<w) = 1 + a(co) — P(co). 

The power spectral density matrix of the transformed signals <t> yy is diagonal. 

Ai(m) 0 • ■ ■ 0 

0 A2(m) • • • 0 

0 0 • ■ • \k(co) 

The first eigenvector just adds all components and scales with 1 / <J~K . For the 
remaining eigenvectors, any orthonormal basis can be used that is orthogonal 
to the first eigenvector. That is, the KLT for our signal model is not dependent 
on a >. Note that for this simple signal model, the Haar transform is also a KLT. 






6.3.4 Transform Coding Gain 

The rate difference [11] is used to measure the improved compression effi- 
ciency for each picture k. 



AR ‘ = i//i l0fa (i 



( 6 . 10 ) 



It represents the maximum bit-rate reduction (in bit per sample) possible by 
optimum encoding of the transformed signal y*, compared to optimum intra- 
frame encoding of the signal c* for Gaussian wide-sense stationary signals 
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for the same mean square reconstruction error. A negative A/?* corresponds 
to a reduced bit-rate compared to optimum intra-frame coding. The overall 
rate difference AR is the average over all pictures and is used to evaluate the 
efficiency of motion-compensated transform coding. Assuming the KLT, we 
obtain for the overall rate difference 



AR 



1 ffK-l ( P(a>,al) \ 

4tt 2 J J 2 K ° g2 V 1 +a(a))J 



+ 



2K 



log 2 1 + (K - 1) 



1 +a(a>) 

P(co, a\) 



1 + Ot((L>) 



dco. (6.11) 



The case of a very large number of motion-compensated pictures is of spe- 
cial interest for the comparison to predictive video coding with motion com- 
pensation. 



7 r t r 

^ / / I ' 0fc (' - 

— 7T — 7T 



P(“>> crl) \ 

1 + a(cu) / 



dco 



( 6 . 12 ) 



Note that the performance of predictive coding with motion compensation and 
optimum Wiener filter achieves a rate difference of 



7t 7X 



P 2 (co,<?l) \ 

[1 +a(o))]V 



dco. 



(6.13) 



We obtain this result from [11], Eqn. 21 with N = 1 and ao = oi\ = a. 

Figs. 6.17 and 6.18 depict the rate difference according to (6.11) and (6.13) 
over the displacement inaccuracy fi = log 2 (vT2oA) for a residual noise level 
RNL = 10 log 10 (a 2 ) of -100 dB and -30 dB, respectively. Note that the vari- 
ance of the “clean” video signal v is normalized to a v 2 = 1. We observe that 
the rate difference starts to saturate for K = 32. This observation is consistent 
with the experimental results in the previous section. For a very large group of 
pictures and negligible residual noise, the slope of the rate difference is limited 
by 1 bit per sample per inaccuracy step, similar to that of predictive coding 
with motion compensation. Further, transform coding with motion compensa- 
tion outperforms predictive coding with motion compensation by at most 0.5 
bits per sample. For example, if we encode frames with statistically indepen- 
dent additive noise, predictive coding with motion compensation is not capable 
of predicting the additive noise in the current frame. In this case, prediction 
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Figure 6.17. Rate difference for motion-compensated transform coding with groups of K pic- 
tures over the displacement inaccuracy f). The performance of predictive coding with motion 
compensation and Wiener filter is labeled by MCP. The residual noise level is -100 dB. 




Figure 6. 18. Rate difference for motion-compensated transform coding with groups of K pic- 
tures over the displacement inaccuracy fi. The performance of predictive coding with motion 
compensation and Wiener filter is labeled by MCP. The residual noise level is -30 dB. 



actually doubles the noise variance in the residual signal and predictive cod- 
ing performs suboptimally. The advantage of the motion-compensated lifted 
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5/3 wavelet over motion-compensated predictive coding is also observed in the 
experimental results. 

6.4 Conclusions 

This chapter discusses motion compensation for groups of K pictures. 
We investigate experimentally and theoretically motion-compensated lifted 
wavelet transforms. The experiments show that the 5/3 wavelet kernel out- 
performs both the Haar kernel and, in many cases, the reference scheme uti- 
lizing motion-compensated predictive coding. The motion-compensated lifted 
wavelet kernels re-use the motion vectors in the prediction step for the update 
step by assuming an invertible block-motion field. This assumption seems to be 
inadequate for sequences with inhomogeneous motion as their rate-distortion 
performance is weaker than expected. 

The theoretical discussion is based on a signal model for K motion- 
compensated pictures that are decorrelated by a linear transform. The dyadic 
decomposition of K pictures with motion-compensated lifted wavelets is re- 
placed by an equivalent coding scheme with K motion-compensated pictures 
and a dyadic wavelet decomposition without motion compensation. That is, 
we remove the displacement operators in the lifting steps and generate a set 
of motion-compensated pictures with an additional constraint on the displace- 
ment eixors. We generalize the model and employ the Karhunen-Loeve Trans- 
form to obtain theoretical performance bounds at high bit-rates for motion- 
compensated 3D transform coding. 

The analysis of this model gives the following insights: The coding gain 
for a group of K pictures is limited and saturates with increasing K. For a 
very large group of pictures and negligible residual noise, the slope of the rate 
difference is limited by 1 bit per sample per inaccuracy step. The slope of the 
rate difference for predictive coding with motion compensation is also limited 
by 1 bit per sample per inaccuracy step but this coding scheme outperforms 
predictive coding with motion compensation by up to 0.5 bits per sample. This 
is also true for very accurate motion compensation when the residual noise 
dominates the coding gain. 




Chapter 7 



SUMMARY 



This work discusses video coding with superimposed motion-compensated 
signals. We build on the theory of multihypothesis motion-compensated pre- 
diction for video coding and introduce the concept of motion compensation 
with complementary hypotheses. Multihypothesis motion compensation lin- 
early combines more than one motion-compensated signal to form the superim- 
posed motion-compensated signal. Motion-compensated signals that are used 
for the supeiposition are referred to as hypotheses. Further, a displacement 
eixor that captures the inaccuracy of motion compensation is associated with 
each hypothesis. As the accuracy of motion compensation is equal for all hy- 
potheses, the displacement eixors are identically distributed. 

This work proposes that the multiple displacement errors are jointly dis- 
tributed and, in particular, coixelated. Investigations show that there is no 
preference among N hypotheses and we conclude that all non-equal pairs of 
displacement eixors are characterized by one correlation coefficient. As the 
covariance matrix of the jointly distributed displacement errors is nonnegative 
definite, we can determine the valid range of the displacement eixor correlation 
coefficient dependent on the number of superimposed hypotheses. 

We investigate the efficiency of superimposed motion compensation as a 
function of the displacement eixor correlation coefficient. We observe that de- 
creasing the displacement eixor correlation coefficient improves the efficiency 
of superimposed motion compensation. We conclude that motion compensa- 
tion with complementary hypotheses results in maximally negatively coixe- 
lated displacement error. 

Motion compensation with complementary hypotheses implies two major 
results for the efficiency of superimposed motion-compensated prediction: 
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First, the slope of the rate difference reaches up to 2 bits per sample per motion 
inaccuracy step whereas for single hypothesis motion-compensated prediction 
this slope is limited to 1 bit per sample per motion inaccuracy step. Here, we 
measure the rate difference with respect to optimum intra-frame encoding and 
use a high-rate approximation. Second, this slope of 2 bits per sample per in- 
accuracy step is already achieved for IV -2 complementary hypotheses. If 
we just average the hypotheses, the performance converges at constant slope 
to that of a very large number of hypotheses. That is, the largest portion of the 
achievable gain is already accomplished with N - 2 complementary hypothe- 
ses. If we employ the optimum Wiener filter, the coding performance improves 
for doubling the number of complementary hypotheses by at least 0.5 bits per 
sample at constant slope. 

In this work, we investigate motion compensation with complementary hy- 
potheses by integrating superimposed motion-compensated prediction into the 
ITU-T Rec. H.263. We linearly combine up to 4 motion-compensated blocks 
chosen from up to 20 previous reference frames to improve the performance 
of inter-predicted pictures. To determine the best N-hypothesis for each pre- 
dicted block, we utilize an iterative algorithm that improves successively con- 
ditional optimal hypotheses. Our experiments show that superimposed predic- 
tion works efficiently for both 16x16 and 8x8 blocks. Multiple reference 
frames enhance the efficiency of superimposed prediction. The superposition 
gain and the multiframe gain do not only add up; superimposed prediction ben- 
efits from hypotheses which can be chosen from several reference frames. Su- 
perimposed prediction with two hypotheses and ten reference frames achieves 
coding gains up to 2.7 dB, or equivalently, bit -rate savings up to 30% for the 
sequence Mobile & Calendar when compared to the one-hypothesis reference 
codec with one reference frame. 

We explore theoretically why superimposed prediction benefits from mul- 
tiframe motion compensation. We model multiframe motion compensation 
by forward-adaptive hypothesis switching and show that switching among M 
hypotheses with statistically independent displacement error reduces the dis- 
placement error variance by up to a factor of M. For motion-compensated 
prediction, we obtain the following performance bounds: Doubling the num- 
ber of reference pictures for single hypothesis prediction reduces the bit-rate 
of the residual encoder by at most 0.5 bits per sample. Whereas doubling the 
number of reference pictures for prediction with complementary hypotheses 
reduces the bit-rate of the residual encoder by at most 1 bit per sample. 

This work also investigates motion compensation with complementary hy- 
potheses for B-pictures in the emerging ITU-T Rec. H.264. We focus on refer- 
ence picture selection and linearly combined motion-compensated prediction 
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signals. We show that bidirectional prediction exploits partially the efficiency 
of combined prediction signals. Superimposed prediction chooses hypotheses 
from an arbitrary set of reference pictures and, by this, outperforms bidirec- 
tional prediction. That is, superimposed motion-compensated prediction with 
multiple reference frames allows a more general form of B-pictures. In addi- 
tion to the generalization of the bidirectional mode, we allow that previously 
decoded B-pictures can be reference pictures for other B-pictures. Again, we 
observe that multiframe motion compensation enhances the efficiency of su- 
perimposed prediction for hybrid video coding. 

Finally, we discuss superimposed motion-compensated signals for motion- 
compensated 3D subband coding of video. We investigate experimentally and 
theoretically motion-compensated lifted wavelet transforms for the temporal 
subband decomposition. The experiments show that the 5/3 wavelet kernel out- 
performs both the Haar kernel and, in many cases, the reference scheme utiliz- 
ing motion-compensated predictive coding. Based on the motion-compensated 
lifting scheme, we develop an analytical model describing motion compensa- 
tion for groups of K pictures. 

The theoretical discussion is based on a signal model for K motion- 
compensated pictures that are decorrelated by a linear transform. The dyadic 
decomposition of K pictures with motion-compensated lifted wavelets is re- 
placed by an equivalent coding scheme with K motion-compensated pictures 
and a dyadic wavelet decomposition without motion compensation. That is, 
we remove the displacement operators in the lifting steps and generate a set 
of motion-compensated pictures with an additional constraint on the displace- 
ment errors. We generalize the model and employ the Karhunen-Loeve Trans- 
form to obtain theoretical performance bounds at high bit-rates for motion- 
compensated 3D transform coding. 

The analysis of this model gives the following insights: The coding gain for 
a group of K pictures is limited and starts to saturate for K = 32 pictures. 
For a very large group of pictures and negligible residual noise, the slope of 
the rate difference is limited by 1 bit per sample per inaccuracy step. The 
slope of the rate difference for motion-compensated prediction is also limited 
by 1 bit per sample per inaccuracy step but this coding scheme outperforms 
motion-compensated prediction by at most 0.5 bits per sample. This is also 
true for very accurate motion compensation when the residual noise dominates 
the coding gain. 

In summary, we characterize the relation between motion-compensated sig- 
nals and, depending on this relation, investigate their efficiency for video com- 
pression. 




Appendix A 
Mathematical Results 



A.l Singularities of the Displacement Error Covariance 
Matrix 



The prediction scheme shows no preferences among the individual hypotheses. This is re- 
flected in the symmetry of the displacement error covariance matrix. The variance of the dis- 
placement error for each hypothesis is a £ and the correlation coefficient between any hypothesis 
pair is p With that, the covariance matrix of the displacement error vector reads 



C=a\ 



( 1 PA 
PA 1 

V pa pa 



Pa \ 
Pa 

1 ) 



(A.l) 



In order to determine the singularities of the matrix, we decompose the covariance matrix into 
the identity matrix I and the matrix 11 ^ with each element equal to one. 

C =a\pi,n T -al( Pti -\)I (A.2) 

The covariance matrix is singular if its determinant is zero 

del (n r - - 1 /J = Q. (A. 3) 

The eigenvalues of the matrix 11^ are X = { Af , 0} which can also be obtained by solving 
det(ll r - XI) = 0. We obtain two singularities for the covariance matrix: 

1 



PA = 



1 - N 



and pa = 1- 



(A.4) 
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A.2 A Class of Matrices and their Eigenvalues 

We consider the following class of normalized N x N matrices with the parameter a € 7Z: 



/ 1 a a N 

a 1 • • a 

\ a a • • ■ 1 / 



(A. 5) 



In order to determine the eigenvalues of C, we decompose the matrix into the identity matrix 1 
and the matrix 11 T with each element equal to one. 

C = a\\ T + (1 -a)l (A. 6) 



The eigenvalues A,' of C solve det(C — A; /) = 0 and with (A. 6), we can write for a ^ 0 



det (V-^l+^/)=0. 



(A.7) 



The eigenvalues of the matrix 11^ are Aj = N, 1-fold, andA 2 = 0, (N - l)-fold. Both can be 
obtained by solving det(ll^ — A,- 1) = 0. With a ^ 0, we have the following eigenvalues for C: 



A] = l +{N — l)a 1-fold 

A 2 = I -a (V-l)-fold 

Iffl =0, the Al-fold eigenvalue of C is 1. 



(A.8) 



A.3 Inverse of the Power Spectral Density Matrix 

The optimum Wiener filter requires the inverse of the hypothesis power spectral density 
matrix Occfw). After normalization, the matrix H\ of the form 



/cl. 


" 1 ^ 


' 1 c ■ 


'• 1 


111. 


■ c ) 



(A.9) 



is inverted. In the following, we prefer a representation of//) with the vector l = (l,l,...,l)^ 
and the identity matrix 1 

Hi = n T + {c-\)i. (A. 10) 

Due to the symmetry of H\ , its inverse shows the same structure 

H~ l =A[ll r + (d- 1)/], (A. 11) 

where k and d are scalar constants. As the inverse of the non-singular matrix (c / 1, c ^ 1 — N) 
is unique, we obtain 



k = 



1 



d 



(c- 1)(1 - c-N )’ 
2 - c — N, 



(A. 12) 
(A. 13) 
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and for the inverse 



H - 1 = H r + (1 ~c~N)I 
1 (c- 1)(1 -c-N) ' 



(A. 14) 



In the case that the noise variance for each hypothesis is not identical, a normalized matrix 



«2 = 



( c\ a \V2 ■■■ a\ a N ^ 

a 2 a \ c 2 * * • Q2 a N 



(A. 15) 



V a N a I a /V«2 • ' • c N / 

is inverted. A representation of Hi with the vector a = (a i , 02, .... and the diagonal 

matrix diag(-) reads 

Hi = aa T + diag(c; - af). (A. 16) 

The inverse shows the same symmetry as Hi'. 



H 2 1 = k [bb T + diag (</, - bf) j , 



(A. 17) 



where k is a scalar constant and b and d are vectors of size N. As the inverse of a non-singular 

1 

matrix is unique, the parameters of H 2 can be calculated for af ^ c/, J = 1, 2, . . . , IV, and 
k £ 0 with 



k = 



b ( = 



di = 



jml a r c i 



af - Ci 



- 1 



2 

af - c t 



L a ?~c I * 



A simplified expression for the inverse reads 

bb T 



Hr 1 = 



E -A- - 1 



-diag 



«?-*) 



with b, = 



(A. 18) 

(A. 19) 
(A.20) 

(A.21) 



j=l“; c i 

7-1 



This solution for H 2 1 is only valid for af £ c,- and k 0. 

The special case a = 1 is also of interest for the hypothesis power spectrum matrix. Hi 
reads 

/ c, 1 ... 1 \ 

1 Cl ... 1 



#3 = 

and its inverse can be calculated by 

bb T 



W 3 _1 = 



Af , 

y rV- 1 

1=1 ^ 



V i i 



< "* 8 (rb-) 



c/v / 



with bj = 



1 -c, 



(A.22) 



(A.23) 



This solution holds only for Cj ^ I, i = 1, 2, . . . , N, and Ey=l 7^ 1 
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A.4 Power Spectral Density of a Frame 

The 2D continuous-space Fourier transform T {•} of the 2D continuous signal v(jt, y) in 
Cartesian coordinates with x, y 6 7Z is defined by 

V(co x ,a>y)= j v(x,y)e~ j(a,xX+a> y y) dxdy. (A.24) 

TV 

o 

The 2D continuous-space Fourier transform of the 2D continuous signal l) ( r , 9) in cylindrical 
coordinates is given by 

In oo 

V {co r , a> e ) = J J i (r, 0)e~ jra> ' rdrdO (A.25) 

0 0 

where 

X~rc0s9, Cl>x = (O r cos cog , (A. 26) 

y = rsin#, co y = u> r sin cog . (A. 27) 



The Fourier transform of the isotropic, exponentially decaying, space-continuous function 



v ( r , 6) = e M ° r V0 



with 0 )q > 0 is also isotropic. 
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dx, x = 1 + y'f2cos(#)(A.32) 
(A.33) 
(A. 34) 



The integral in (A. 32) is solved in [229, 230]. In Cartesian coordinates, the Fourier transform 
of the function v(x, y) = e~~ a) °^ x2 + y ^ reads 



2n 



V (CO X , COy) = I 1 4- 



"o / 



(A.35) 




Glossary 



Acronyms 




BL 


Baseline 


CABAC 


Context -Based Adaptive Binary Arithmetic Coding 


CIF 


Common Intermediate Format 


DCT 


Discrete Cosine Transform 


DHT 


Dyadic Haar Transform 


D5/3T 


Dyadic 5/3 Transform 


ECVQ 


Entropy Constrained Vector Quantization 


GOP 


Group of Pictures 


HP 


Half-Pel 


INTER4V 


Inter-prediction mode with four 8x8 blocks 


INTER2H 


Inter-prediction mode with two hypotheses 


INTER4H 


Inter-prediction mode with four hypotheses 


INTER4VMH 


Inter-prediction mode with four 8x8 superimposed blocks 


IP 


Integer-Pel 


ITU 


International Telecommunication Union 


ITU-T 


ITU Telecommunication Standardization Sector 


KLT 


Karhunen-Loeve Transform 


MCP 


Motion-Compensated Prediction 


MHP 


Superimposed Prediction 


MPEG 


Moving Picture Experts Group 
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OBMC 


Overlapped Block Motion Compensation 


PDF 


Probability Density Function 


PSNR 


Peak Signal to Noise Ratio 


QCIF 


Quarter Common Intermediate Format 


RNL 


Residual Noise Level 


TML-9 


H.26L Test Model Software Version 9 


UVLC 


Universal Variable Length Coding 


VBS 


Variable Block Size 


VCEG 


Video Coding Experts Group 


VQ 


Vector Quantization 


Probability Theory 


n 


Set of real numbers 


n 


Two-dimensional orthogonal unit grid 


a 


Random variable, process, or field 


Pr{a < a] 


Probability of the event {a < a] 


R,{a) 


Reliability function of a 


pM) 


PDF of a 




Variance of a 


P* 


Correlation coefficient associated with a 


< / , ab(') 


Correlation function between a and b 


Oab(-) 


Cross spectral density of a and b 


£{'} 


Expectation operator 


Matrix Algebra 


1 


Column vector with all entries equal to one 


1! ■ 111 


Square norm of a vector 


M 


Length of a code word vector 


det(-) 


Determinant of a matrix 


diag(-) 


Diagonal matrix 


/ 


Identity matrix 


* 


Complex conjugate operator 




Glossary 

H 

T 

Transfonns 

FA-) 
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Hermitian conjugate operator 
Transpose operator 



2D band-limited discrete-space Fourier transform 
2D continuous-space Fourier transform 
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